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Abstract 


The  natural  environment  is  burdened  with  a  broad  range  of  toxic  chemicals,  and 
there  is  a  need  to  develop  a  tool  that  can  accelerate  the  pace  at  which  we  learn  how 
chemicals  impact  disease.  This  work  developed  an  artificial  neural  network  (ANN) 
based  model  that  constructed  chemical-disease  relationships  for  chemicals  found  in  the 
Comparative  Toxicogenomics  Database.  A  new  chemical  classification  system,  based  on 
the  molecular  weight,  hydrogen  donors,  and  hydrogen  acceptors,  was  created  to  identify 
chemicals  with  a  unique  number  that  is  directly  related  to  these  structural  properties  of 
the  chemical.  Diseases  were  grouped  into  27  categories  and  the  chemical-disease 
associations  were  made  between  the  chemical  and  its  associated  disease  category.  The 
ANN  model  was  successfully  trained  and  tested  to  associated  75  chemical  with  the  27 
disease  categories.  Simulations  with  training-validation-testing  ratios  of  70-15-15  percent 
produced  coefficients  of  determination  equal  to  0.99,  and  the  Levenberg-Marquardt 
backpropagation  function  provided  the  best  network  performance.  To  help  validate  the 
model,  the  ANN  was  also  used  to  evaluate  chemical-disease  relationships  for  three 
uncurated  chemicals.  Results  showed  that  ANNs  have  the  potential  to  predict  disease 
associations  for  uncurated  chemicals  and  to  guide  research  for  curated  chemicals  that 
may  require  further  toxicological  testing. 
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ARTIFICIAL  NEURAL  NETWORK  PREDICTION  OF  CHEMICAL-DISEASE 
RELATIONSHIPS  USING  READILY  AVAILABLE  CHEMICAL  PROPERTIES 


I.  Introduction 

The  natural  environment  is  burdened  with  a  broad  range  of  toxic  chemicals, 
including  petroleum  products,  metals,  pesticides,  pharmaceutical  compounds,  organic 
solvents,  and  numerous  other  hazardous  substances.  Most  of  these  chemicals  have  the 
potential  to  cause  ecological  harm  and  they  also  pose  significant  risks  to  human  health. 
Toxicological  testing  has  helped  reveal  the  connections  between  specific  chemicals  and 
health  risk  factors,  but  experimental  testing  on  indicator  species  is  expensive  and  time 
consuming,  while  testing  on  humans  is  illegal  and  unethical.  There  is  a  need  to  develop  a 
tool  that  can  accelerate  the  pace  at  which  we  learn  how  chemicals  impact  disease.  Such  a 
tool  would  allow  the  benefits  of  a  given  chemical  to  be  weighed  against  the  risks  to  the 
environment  and  public  health. 

Risks  to  the  environment  and  to  public  health  are  governed  by  the  interactions 
between  chemicals,  environmental  factors,  and  the  genes  that  modulate  important 
physiological  processes,  and  there  are  large  databases  containing  information  that  can  be 
used  to  advance  fundamental  understanding.  For  example,  the  Comparative 
Toxicogenomics  Database  (CTD)  is  a  publicly  available  research  resource  that  includes 
curated  data  describing  cross-species  chemical-gene/protein  interactions  and  chemical- 
and  gene-disease  associations.  The  CTD  contains  over  800,000  chemical-gene 
interactions,  more  than  12,400,000  gene-disease  associations,  and  over  1,300,000 
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chemical-disease  associations  (Davis  et  al.,  2013).  These  data  can  be  used  to  develop 
insights  into  complex  chemical-gene  and  protein  interaction  networks. 

The  existing  CTD  data  can  be  used  to  develop  a  model  than  can  predict  the  effect 
of  chemical  structure  on  public  health  risk.  For  such  a  model,  Artificial  Neural  Networks 
(ANN)  can  be  used.  ANNs  are  flexible  mathematical  models  that  are  capable  of 
identifying  complex  nonlinear  relationships  between  input  and  output  data  sets.  These 
models  are  especially  useful  when  it  is  too  difficult  to  use  conventional  mathematical 
equations.  ANNs  recognize  patterns  and  they  work  by  converting  input  data  into 
numerical  values  that  are  propagated  through  a  network  of  neurons.  The  network  of 
neurons  processes  the  data  given  to  the  network  by  attempting  to  find  patterns  in  the  data 
so  that  inputs  can  be  correlated  outputs.  ANNs  have  been  used  for  a  wide  range  of 
environmental  and  public  health  applications  and  they  are  ideal  when  there  is  a  large 
amount  of  data  available  for  ANN  development. 

One  obstacle  in  investigating  chemical-disease  relationships  is  the  lack  of  a  useful 
chemical  classification  system;  one  that  uses  specific  chemical  characteristics  to  assign 
chemical  identification  numbers.  Currently,  several  individual  classification  systems 
provide  unique  classification  numbers  for  each  chemical;  however,  these  numbers  are  not 
related  to  the  properties  of  the  chemical  and  often  are  randomly  assigned.  Therefore, 
developing  a  modified  chemical  classification  system  is  an  important  task  for  the 
development  of  chemical-disease  relationships.  It  would  permit  policy  makers  and 
scientists  to  anticipate  diseases  that  would  be  likely  associated  with  new  chemicals  or 
existing  chemicals  that  require  further  testing. 
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The  Environmental  Proteetion  Agency  (EPA)  and  National  Science  Foundation 
(NSF)  have  expressed  interest  in  chemical-disease  relationships  for  the  purpose  of 
characterizing  chemical  lifecycles.  In  2013,  as  part  of  a  joint  solicitation,  the  EPA  and 
NSF  requested  research  be  conducted  that  studies  the  lifecycles  of  synthetic  chemicals, 
including  a  focus  on  their  impacts  on  human  health  and  the  ecology  (National  Science 
Foundation,  2013).  ANNs  could  provide  an  appropriate  tool  to  investigate  chemical  life- 
cycles,  especially  when  analyzing  chemical-disease  associations.  Understanding  how 
chemicals  interact  in  a  given  environment  and  how  they  could  affect  surroundings  play  a 
role  in  the  lifecycle  of  a  chemical. 

The  EPA  could  also  use  an  ANN  tool  to  add  important  chemical  association 
information  to  the  Toxic  Substances  Control  Act  (TSCA)  inventory.  When  the  TSCA 
was  implemented  in  1976,  over  62,000  chemicals  were  grandfathered  into  the  inventory 
without  any  knowledge  of  their  potential  affects  (Environmental  Protection  Agency, 
2013).  In  the  past  38  years,  the  number  of  chemicals  in  the  TSCA  inventory  has  grown 
to  over  84,000,  yet  only  4  chemicals  are  specifically  addressed  within  the  TSCA 
document  and  only  a  few  others  have  been  regulated  or  harmed  in  the  United  States 
(Congressional  Digest,  2010).  With  so  many  chemicals  with  unknown  chemical-diseases 
associations  existing  in  the  TSCA  inventory,  a  simple  analytical  tool  to  generate 
chemical-disease  association  predictions  may  provide  valuable  information  for 
potentially  unknown  harmful  chemicals.  Using  a  predictive  ANN  to  generate  potential 
chemical  impacts  could  increase  the  usefulness  of  the  TSCA. 
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Significance  of  Research 


With  the  ability  to  generate  predictions  of  unknown  chemical-disease 
relationships,  the  ANN  provides  the  possibility  of  being  a  useful  tool  for  researchers  in 
the  science  and  medical  fields  investigating  the  potential  effects  of  new  chemicals.  A 
network  that  could  point  researchers  towards  the  effects  a  chemical  will  have  could  help 
save  valuable  time  and  resources  when  it  comes  to  creating  and  testing  chemicals.  When 
used  as  a  screening  and  prioritization  tool,  an  ANN  may  be  useful  in  influencing  where 
researchers  begin  testing  and  analyzing  chemicals.  As  the  network  is  expanded  through 
future  research,  it  could  potentially  be  used  to  predict  potential  interactions  other  than  just 
chemical-disease  associations.  Refining  the  classification  number  and  training  the 
network  with  different  outputs  could  allow  the  network  to  predict  how  chemicals  may 
interact  if  released  in  to  a  natural  environment.  An  ANN  with  this  type  of  capability 
could  be  adjusted  to  work  with  the  Environmental  Protection  Agency  and  National 
Science  Foundation’s  research  request  for  using  networks  to  characterize  the  lifecycle  of 
chemicals.  Additionally,  valuable  information  could  be  added  to  the  TSCA  inventory 
providing  data  on  potentially  harmful  chemicals  which  were  grandfathered  into  the 
system  with  no  known  associations.  The  true  significance  of  using  an  ANN  to  predict 
chemical-disease  associations  will  become  more  evident  as  further  research  and  testing  is 
done  to  refine  the  ANN  model.  As  the  model  becomes  more  efficient  and  produces  more 
accurate  results,  it  will  be  more  useful  to  the  scientific  community  in  the  screening  of 
new  chemicals. 
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Implications  of  Research 


After  investigating  the  researeh  objeetives  and  analyzing  the  results  of  the  ANN 
simulations,  it  ean  reasonably  be  assumed  that  a  MATLAB  ANN  ean  be  used  to  analyze 
chemical  and  disease  data  and  formulate  a  network  that  can  possibly  predict  future 
chemical-disease  associations.  The  creation  and  use  of  a  new  chemical  classification 
system  with  an  ANN  was  also  demonstrated  and  results  show  that  a  new  classification 
method  could  be  advantageous  when  working  with  chemical-disease  interactions. 
Although  the  classification  system  developed  worked  for  the  simulations  conducted  in 
this  research,  it  does  not  mean  that  the  classification  represents  the  best  method  or  uses 
the  most  appropriate  chemical  properties  in  the  classification  number.  However,  it  does 
indicate  that  a  classification  number  based  on  chemical  attributes  is  certainly  a  possibility 
and  be  useful  in  research  and  experimentation.  Similar  to  the  classification  system,  the 
ANN  shows  the  potential  for  developing  networks  that  can  predict  chemical-disease 
relationships;  however,  the  current  network  may  not  provide  the  best  performance 
possible.  Training-validating-testing  (TVT)  ratios  and  training  functions  play  important 
roles  in  the  development  of  the  ANN  and  show  strong  correlation  to  how  well  the 
network  performs,  but  there  are  many  other  factors  that  can  be  edited  and  tested  that 
could  improve  network  further.  Using  data  from  the  CTD  shows  that  a  network  could  be 
created  on  a  larger  scale  and  not  be  bound  to  specific  groups  of  chemicals  or  diseases. 
Combining  the  CTD  data  with  the  new  classification  system  and  ANN  confirms  that 
chemical-disease  association  prediction  can  be  accomplished  on  a  large  scale,  not  just  in 
smaller  quantitative  structure-activity  relationship  research  studies. 
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II.  Literature  Review 


The  relationships  between  chemical  properties  and  disease 

Several  studies  have  explored  the  relationship  between  chemical  properties  and 
health  risk  factors  or  physiological  impacts  on  indicator  species.  For  example,  Schultz  et 
ah,  2002  discovered  positive  relationships  among  120  different  aromatic  compounds  and 
estrogenicity  based  on  the  number  of  hydrogen  bond  donating  groups  in  the  aromatic 
compound.  They  also  found  that  the  number  of  hydrogen  bond  accepting  groups  was 
negatively  linked  to  estrogenicity.  Fang  et  ah,  2001  looked  at  230  natural  and  synthetic 
steroids  and  discovered  that  estrogenicity  related  negatively  to  the  number  of  hydrogen 
bond  donating  groups  in  the  steroid.  They  also  discovered  that  estrogenicity  was 
positively  linked  to  the  octanol-water  partition  coefficients  of  the  steroids.  Lipinski  et  ah, 
2012  expanded  on  this  research  looking  at  2500  organic  compounds  and  discovered 
similar  results  to  that  of  Fang  et  al.  Quantitative  structure  activity  relationships  are  not 
limited  to  only  estrogenicity,  as  they  have  been  used  to  model  numerous  other  chemical- 
disease  relationships.  Ren,  2002  found  that  hydrophobicity  and  hydrogen  bonds  could  be 
used  to  predict  the  toxicity  of  a  chemical  and  Svetnik  et  al.,  2003  determined  molecular 
weight  could  be  used  to  predict  a  chemical’s  biological  activity.  Wu  et  al.,  2013  also 
discovered  that  hydrophobicity  and  electron  density  can  predict  antibacterial  qualities  of 
chemicals.  This  previous  work  shows  that  structural  properties  of  chemicals  can  be  used 
to  predict  associations  between  chemicals  and  other  factors. 
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Using  Artificial  Neural  Networks  in  medical  research 


ANNs  are  flexible,  mathematical  models  capable  of  identifying  complex, 
nonlinear  relationships  in  data  sets.  They  are  capable  of  discovering  patterns  in  large 
amounts  of  data  and  have  been  shown  to  be  useful  in  environmental  and  public  health 
applications  (Beale  et  al.,  2013).  ANNs  take  a  set  of  input  and  output  data  and  develop 
correlations  between  the  two  data  sets  by  using  hidden  layers  of  mathematical  formulas, 
weights,  and  biases.  The  formulas  are  determined  by  the  type  of  training  function 
specified  to  be  used  by  the  network  during  the  simulation.  The  weights  and  biases  are 
placed  on  the  input  data  as  the  network  is  tested  and  they  can  be  adjusted  to  help  improve 
network  performance.  After  testing  the  known  input  and  output  data  with  the  training 
function  formulas,  weight,  and  biases,  the  network  derives  outputs  that  are  compared  to 
the  actual  outputs. 

When  setting  up  an  ANN,  two  important  parameters  of  a  network  are  the  TV  ratio 
and  the  training  function.  TVT  ratios  establish  how  the  data  is  divided  for  use  in  training, 
validating,  and  testing  the  network  model.  Training  functions  are  the  algorithms  that 
determine  how  the  network  trains  the  data  while  it  attempts  to  find  patterns  and 
correlations  between  the  input  and  output  data  (Beale  et  al.,  2013).  The  use  of 
appropriate  TVT  ratios  is  important  for  optimizing  network  performance  because  the 
ratios  will  determined  if  a  network  is  undertrained  or  overtrained.  Seguritan  et  al.,  2012 
found  that  testing  different  training  and  validation  ratios  did  not  provide  any  significant 
difference  in  the  overall  network  performance,  but  adjusting  the  testing  ratio  did  show 
potential  for  increasing  performance.  Ahmad  and  Gromiha,  2003  calculated  high  ANN 
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prediction  accuracy  rates  when  using  TVT  ratios  that  used  a  majority  percentage  of  the 
data  training  the  network  and  Guyon,  1997  found  that  the  ratio  of  validation  data  to 
training  data  should  be  between  10  and  25  percent.  Singh  et  al.,  2011  compared  three 
training  functions  in  a  neural  network  and  found  the  trainbr  function  provided  the  best 
network  performance,  but  that  more  than  three  functions  should  be  tested  to  truly  find  the 
function  that  best  fits  the  network.  Guenther  and  Frauke,  2010  showed  that  resilient 
backpropagation  functions  performed  well  in  regression  ANNs  but  indicated  only  three 
types  of  functions  were  tested  and  other  functions  may  provide  similar  or  better  results. 
Ferrari  and  Stengel,  2005  found  that  algebraic  training  functions  may  be  used  to  create 
linear  correlations  from  non-linear  datasets  with  multiple  input  and  output  variables. 

Overall,  research  has  shown  that  ANN  can  be  useful  in  diagnostic  and  predictive 
applications  when  provided  the  proper  data.  ANNs  have  been  used  in  the  medical 
community  to  address  concerns  related  to  specific  diseases  or  groups  of  diseases.  For 
example,  Stephan  et  al.,  2009  used  ANNs  to  distinguish  between  benign  and  malignant 
prostate  cancer  and  Santos-Garcia  et  al.,  2004  used  ANNs  to  predict  morbidity  from 
cardio  respiratory  failure  as  a  result  from  non-small  cell  lung  cancer  pulmonary  resection. 
Curtis  et  al.,  2001  used  ANNs  to  associate  genotypes  with  common  human  diseases  and 
Sheppard  et  al.,  1999  used  ANNs  to  predict  the  risks  of  contracting  cytomegalovirus 
disease  after  kidney  transplantations.  Nguyen  et  al.,  2002  used  ANNs  to  predict  patient 
susceptibility  to  meningitis. 

Nearly  all  of  the  data  used  in  the  ANN  analyses  for  clinical  and  medical  research 
comes  from  hard  to  obtain  data  or  data  that  requires  a  great  deal  of  effort  to  acquire.  This 
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hard  to  obtain  data  often  requires  additional  testing  and  information  gathering  to  acquire 
the  data  needed  for  the  ANN.  This  often  requires  a  great  deal  of  time  and  resources  from 
the  medical  personnel.  For  example,  Song  et  al.,  2005  used  ultrasound  image  results  and 
interpretations  from  physicians  to  investigate  the  ANN  diagnosis  of  breast  masses  and 
Viazzi  et  al.,  2006  had  to  obtain  cardiac  and  vascular  ultrasound  information  from 
physicians  and  adjust  it  to  work  in  the  ANN  model.  While  useful  in  medical  the  medical 
field,  many  ANN  applications  require  additional  data  or  testing  to  successfully  use  the 
network. 

ANN  in  diagnosis 

ANNs  have  shown  potential  to  be  used  in  helping  doctors  diagnose  lung  diseases 
by  analyzing  clinical  and  radiological  factors  in  addition  to  relying  on  chest  radiographs. 
Abe  et  al.,  2002  and  Abe  et  al.,  2004  presented  evidence  that  radiologists  could  use  ANN 
output  data,  in  addition  to  x-rays,  to  diagnose  lung  diseases.  Their  findings  indicated  that 
using  clinical  factors  in  an  ANN  could  potentially  prove  to  be  more  useful  when 
diagnosing  interstitial  lung  disease.  Ashizawa  et  al.,  1999  also  found  that  ANNs  used  by 
radiologists  increased  the  accuracy  of  lung  diseases  diagnosis.  Research  performed  by 
Feng  et  al.,  2012  discovered  that  ANN  proved  capable  of  diagnosing  lung  cancer  as  well 
differentiating  it  from  benign  lung  disease,  gastrointestinal  cancers,  and  control  patients 
by  analyzing  various  blood  levels  in  patients. 

ANNs  are  not  only  limited  to  be  used  in  diagnosing  lung  disease.  In  2009, 
Babaoqlu  et  al.  concluded  that  ANN  could  be  used  to  analyze  exercise  stress  testing  data 
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to  correctly  diagnose  eoronary  artery  disease  as  well  predict  the  loeations  of  lesions  near 
the  heart.  Lux  et  ah,  2013  found  that  hereditary  hemorrhagic  telangiectasia  could  be 
diagnosed  by  obtaining  mid-infrared  speetroseopy  from  blood  plasma  and  analyzing  the 
data  through  an  ANN  instead  of  eonducting  the  typieal  and  costly  clinical  tests.  Matsuki 
et  ah,  2002  found  that  by  taking  clinical  parameters  and  radiologie  findings  from  high- 
resolution  CT  scans  and  analyzing  the  data  with  an  ANN,  that  radiologists  could 
accurately  diagnose  nodules  as  benign  or  malignant  without  having  to  eonduct  further 
invasive  testing  on  the  patients.  Arterial  blood  gas  values  were  predicted  based  off  of 
venous  blood  gas  values  in  an  effort  to  better  assess  patients  with  aeute  exacerbations  of 
ehronic  obstruetive  pulmonary  disease  (AECOPD)  by  Raoufy  et  ah,  201 1.  Arterial  blood 
gas  values  provide  the  best  diagnostie  evidenee  of  AECOPD  but  can  be  diffieult  to 
obtain.  Using  an  ANN  to  correlate  venous  blood  gas  values  to  arterial  blood  gas  values 
provided  an  aeeurate  method  to  deteet  AECOPD  hyperearbia.  Deng  et  ah,  1999  found 
that  when  ANNs  were  eombined  with  MRls,  physieians  were  able  to  suceessfully 
diagnose  Alzheimer’s  disease  in  potential  patients  and  Matake  et  ah,  2006  found  that  the 
aeeuraey  of  radiologist  diagnosis  of  hepatic  masses  increased  when  ANN  were  used  to 
analyze  nine  clinical  parameters  from  eomputed  tomographic  scans.  Hamilton  et  ah, 
2006  sueeessfully  used  ANNs  to  diseriminate  between  parkinsonian  syndrome  and 
essential  tremors  based  on  the  ratio  of  traeer  aeeumulation  between  the  eaudate  nucleus 
and  putamen.  This  model  provided  a  tool  to  diagnose  parkinsonian  syndrome  early 
without  confusing  it  with  essential  tremors. 
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ANN  in  predictions 


In  addition  to  aiding  medical  professionals  in  diagnostics,  ANNs  have  also  been 

shown  to  be  useful  in  predicting  diseases  as  well.  The  different  between  predicting  and 

diagnosing  diseases  is  diagnosing  is  identifying  a  disease  or  condition  that  is  already 

present.  Predicting  uses  current  data  to  forecast  potential  future  disease  or  conditions 

without  any  current  symptoms.  For  example,  Biglarian  et  al.,  2012  found  that  ANNs 

were  useful  is  predicting  distant  metastasis  in  colorectal  patients  and  that  the  ANN 

models  were  more  accurate  that  logical  regressions  models.  Colak  et  al.,  2008  created  an 

ANN  model  that  highlighted  promising  results  for  predicting  coronary  artery  disease 

without  the  need  to  invasive  testing  for  diagnosis.  The  work  of  Cucchetti  et  al.,  2007 

demonstrated  that  ANNs  were  more  accurate  than  the  current  model  for  end-stage  liver 

disease  used  to  prioritize  patients  with  liver  cirrhosis  for  donor  organs  and  Ghoshal  et  al., 

2008  used  ANN  to  predict  the  mortality  of  patients  with  cirrhosis  of  the  liver.  Using 

ANNs  for  this  could  help  doctors  better  prioritize  transplant  candidates,  potentially 

reducing  the  mortality  rate  of  patients  waiting  for  donor  organs.  Dagli  et  al.,  2012  used  an 

ANN  to  predict  anemia  in  patients  with  Behcet  disease.  Their  model  provided  a  99% 

correct  anemia  prediction  rate  using  prohepcidin  and  hepcidin  levels  as  well  as  several 

other  common  blood  parameters.  El-Solh  et  al.,  1999  found  that  ANNs  were  more 

accurate  in  predicting  pulmonary  tuberculosis  that  medical  assessments  performed  by 

physicians.  Recurrence  of  non-invasive  transitional  cell  carcinoma  of  the  urinary  bladder 

was  predicted  in  an  ANN  model  by  Fujikawa  et  al.,  2003  and  the  model  proved  to  be 

more  accurate  than  current  prognosis  techniques.  Matsui  et  al.,  2002  developed  an  ANN 
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model  that  showed  promising  results  for  being  able  to  replace  old  methods  of  predicting 
prostate  cancer  in  Japanese  men.  Although  in  need  of  further  refinement,  the  model 
could  be  used  to  predict  the  pathological  stage  of  prostate  cancer.  Ning  et  ah,  2006 
predicted  levels  of  hypertension  using  physician  and  patients  comment  data  in  an  ANN. 
ANNs  were  shown  to  be  a  possible  tool  for  predicting  Graves’  disease  in  patients  as  a 
result  of  antithyroid  drug  withdrawal  by  Orunesu  et  ah,  2004  and  Salvi  et  ah,  2002  found 
that  ANN  could  be  used  to  predict  the  progression  of  thyroid-associated  ophthalmopathy. 

The  articles  discussed  in  this  literature  review  represent  a  sample  of  the  articles 
that  can  be  found  on  these  topics.  It  is  meant  to  provide  an  understanding  of  previous 
work  conducted  with  chemical-disease  relationships  and  ANNs  in  the  scientific  and 
medical  research  communities  by  showcasing  relevant  research.  From  analyzing 
previous  research  related  to  chemical-disease  relationships  and  the  use  of  ANNs  in 
investigating  chemical-disease  associations,  it  can  be  concluded  that  with  proper  set-up, 
ANNs  can  be  used  to  predict  potential  disease  associations  from  various  variable  inputs. 
These  previous  efforts  help  to  identify  and  define  the  research  objectives  for  optimizing 
ANN  performance  and  then  using  the  network  to  predict  chemical-disease  associations. 
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III.  Research  Objectives 


From  the  analysis  of  the  literature  review,  several  areas  of  researeh  beeame 
apparent  that  would  attempt  to  solve  the  issues  presented  in  Chapter  1 .  The  researeh 
objectives  of  this  thesis  are  as  follows: 

1)  What  chemical  structural  characteristics  should  be  used  to  create  a  new 
chemical  classification  system  capable  of  being  used  in  identifying  chemical- 
disease  associations?  The  first  problem  addressed  will  be  the  creation  of  a 
new  chemical  classification  system.  Reviews  of  previous  research  efforts  will 
be  used  to  highlight  potential  chemical  properties  that  could  be  successfully 
used  to  create  a  new  numbering  system. 

2)  How  can  MATLAB  ANN  capabilities  be  used  to  connect  chemicals  to 
diseases  using  chemicals  structural  properties?  Before  a  predictive  ANN  can 
be  created  in  MATLAB,  it  needs  to  be  shown  that  the  MATLAB  ANN  can 
properly  analyze  the  input  and  output  data  of  the  new  chemical  classification 
system  obtained  from  the  CTD.  Showing  that  the  MATLAB  ANN  can  be  used 
for  this  purpose  establishes  the  foundation  needed  to  continue  on  to  the  next 
phase  of  research. 

3)  What  TVT  ratio  provides  the  best  network  performance?  Investigating  the 
best  TVT  ratio  to  use  with  the  chemical  and  disease  data  in  the  network  is 
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important  so  that  the  network  performs  at  the  highest  level  possible.  If  a 
suboptimal  TVT  ratio  is  used,  the  predictability  of  the  network  will  suffer. 

4)  What  training  function  provides  the  best  network  performance?  A  proper 
training  function  used  in  the  network  is  equally  as  important  as  a  proper  TVT 
ratio.  The  training  function  establishes  the  algorithm  the  network  will  use  to 
train  the  data  and  update  the  weights  and  biases  of  the  network.  There  are 
several  training  functions  to  choose  from  in  the  MATLAB  ANN  and  each  one 
trains  the  data  differently.  Determining  the  proper  training  ratio  increases  the 
prediction  potential  of  the  network. 

5)  How  can  the  network  be  used  to  predict  diseases  that  are  linked  to  uncurated 
chemicals?  The  final  step  in  addressing  the  problem  statement  is  determining 
if  the  network  can  be  used  to  accurately  predict  disease  outputs.  This  will  be 
done  by  creating  an  ANN  with  known  chemical-disease  relationship  using 
curated  chemical  data  from  the  CTD.  Then,  uncurated  chemicals  will  be 
entered  into  the  system  and  the  outputs  will  be  analyzed  to  see  if  the  model 
can  generate  valid  predictions. 
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IV.  Methodology 


Overview 

This  chapter  will  discuss  the  methods  used  to  acquire,  test,  and  analyze  the  data  in 
attempting  to  show  that  a  predictive  ANN  model  can  be  created  to  correlate  chemicals  to 
unknown  disease  associations.  The  data  source  and  required  data  information  will  be 
explained  as  well  as  how  the  ANN  will  be  used  to  analyze  the  data.  The  three 
simulations  needed  to  complete  the  research  will  be  addressed  as  well  as  how  the  output 
data  will  be  analyzed  to  obtain  the  final  results. 

Chemical  Classification  System 

A  new  numbering  system  was  created  that  attempted  to  incorporate  specific 
qualities  of  the  chemical  into  the  classification  number  while  still  ensuring  that  each 
chemical  would  have  an  individual  and  unique  number.  From  investigating  previous 
research  of  quantitative  structure  activity  relationships,  several  studies  demonstrated  that 
readily  available  physical  properties  of  chemicals  could  be  used  to  predict  a  chemical’s 
effect  when  used  in  an  appropriate  model  (Schultz  et  ah,  2002,  Fang  et  ah,  2001,  Lipinski 
et  ah,  2012,  Ren,  2002,  Svetnik  et  ah,  2003).  Using  readily  available  chemical  properties 
allows  for  a  chemical  classification  system  number  to  be  created  without  extensive 
testing  or  data  collecting. 

Three  chemical  traits  that  proved  useful,  particularly  in  a  chemicals  effect  on 
estrogenicity,  were  molecular  weight,  hydrogen  donors,  and  hydrogen  acceptors  (Lipinski 
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et  al.,  2012).  In  addition  to  being  successfully  used  in  previous  research,  these  three 
chemical  traits  are  also  fairly  simple  to  obtain.  Without  performing  complex  testing,  the 
molecular  weight  can  be  easily  calculated  from  the  chemical  formula  while  the  hydrogen 
acceptors  and  donors  are  added  up  based  on  the  number  of  lone  pair  electrons  and  atoms 
bonded  to  at  least  one  hydrogen  atom.  For  example,  the  oxygen  atom  in  water  has  one 
free  pair  of  electrons  so  water  has  one  hydrogen  acceptor.  The  oxygen  atom  in  water  is 
also  connected  to  two  hydrogen  atoms  so  water  has  one  hydrogen  donor. 

The  numbering  system  created  gave  each  chemical  a  ten-digit  number  that  was 
exclusive  to  that  specific  chemical.  The  first  six  digits  of  the  number  correspond  to  the 
molecular  weight  of  the  chemical  including  two  decimal  places.  The  seventh  and  eight 
digits  represent  the  number  of  hydrogen  acceptors  and  the  ninth  and  tenth  digits  represent 
the  number  of  hydrogen  donors.  Figure  1  shows  an  example  of  how  the  new  chemical 
classification  number  is  created  for  water.  Water  has  a  molecular  weight  of  18.015 
g/mol,  1  hydrogen  acceptor,  and  1  hydrogen  donor.  Inputting  these  numbers  into  the  new 
classification  system,  the  new  number  generated  for  water  is  0018020101. 
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Figure  1:  New  Chemical  Classification  Example 

The  molecular  weight  in  the  classification  number  is  assigned  to  the  hundredths 
place  to  provide  sufficient  accuracy  and  uniqueness  for  identifying  the  chemical.  The 
number  of  hydrogen  acceptors  and  hydrogen  donors  further  reduce  the  possibility  that 
two  different  chemicals  would  have  the  same  classification  number. 

Data 


The  data  used  in  the  MATLAB  simulations  comes  from  the  online  CTD.  The 

CTD  is  developed  and  maintained  through  a  joint  effort  between  North  Carolina  State 

University  and  Mount  Desert  Biological  Laboratory  and  also  receives  financial  support 

from  the  National  Institute  of  Environmental  Health  Sciences.  The  primary  goal  of  the 

CTD  is  to  advance  the  understanding  of  the  effects  of  environmental  chemicals  on  human 

health  through  studying  the  relationships  between  chemicals,  genes,  and  diseases.  This 

online  database  is  a  collection  of  curated  and  uncurated  data  containing  information  for 

chemicals,  genes,  and  diseases.  Curated  data  is  data  that  has  peer-reviewed,  scientific 

research  to  prove  existence  of  the  data  relationship  (i.e.  chemical-disease  association). 
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Uncurated  data  does  not  have  any  literature  showing  interactions  with  other  factors.  The 
CTD  contains  over  800,000  ehemical-gene  assoeiations,  12,400,000  gene-disease 
assoeiations,  and  1,300,000  ehemical-disease  assoeiations  with  additional  data  and 
updates  made  weekly.  Chemieals,  genes,  and  diseases  that  are  curated  have  been 
organized  based  on  doeumented  seientific  researeh.  Uncurated  chemicals,  genes,  and 
diseases  do  not  have  the  support  of  peer-reviewed  research.  While  uneurated  data  do  not 
have  doeumentation  to  prove  associations,  the  CTD  does  list  possible  assoeiations 
through  inferences.  An  inferred  assoeiation  between  a  ehemieal  and  a  disease  is 
established  with  a  eurated  ehemieal-gene  and  gene-disease  relationship.  For  example, 
aeetone  has  a  curated  relationship  with  the  eatalase  (CAT)  gene  and  the  CAT  gene  has 
been  shown  to  affect  asthma  in  humans.  There  is  no  direet  link  between  aeetone  and 
asthma  but  it  can  be  inferred  through  the  eurated  relationships  with  the  CAT  gene  (Davis 
et  al.,  2013) 

The  CTD  website  offers  several  organization  and  researeh  functions  that  can  be 
used  to  explore  how  the  ehemieals,  genes,  and  diseases  interact  and  related  to  one 
another.  These  funetions  allow  users  to  researeh  specifie  categories  within  the  data  and 
specify  particular  relationships  of  interest.  In  addition  to  the  seareh  functions,  user  can 
download  entire  sets  of  the  database  to  use  in  simulations  and  researeh. 

Chemicals 

The  ehemieals  used  in  the  MATLAB  ANN  simulations  were  randomly  chosen 
from  the  CTD.  As  chemicals  were  selected,  they  were  eheeked  to  make  sure  each 
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chemical  was  curated  so  that  known  disease  associations  existed  to  use  in  the  network. 


Chemicals  were  also  traced  back  to  their  most  common  ancestor  chemical  if  they  were  a 
descendant.  Descendant  chemicals  were  traced  back  to  a  common  ancestor  to  ensure 
there  was  enough  data  to  use  in  the  simulations.  Once  the  ancestor  chemicals  were 
determined,  the  molecular  weight,  hydrogen  acceptors,  and  hydrogen  donors  were 
determined  and  the  new  classification  system  number  was  generated  for  each  chemical. 
Table  1  shows  the  list  of  chemicals  used  in  the  simulations,  along  with  the  classification 
number,  chemical  formula,  and  chemical  structure  diagram.  The  chemical  formulas  and 
structure  diagrams  were  included  to  show  the  diversity  of  the  chemicals  used. 
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Table  1:  Chemical  Information 


New 

Chemical  Name  Classification 

Number 


Acetone 


Aciclovir 


Alprazolam 


Aspirin 


Atenolol 


0058080100 


Ammonium  Sulfate  0132140402 


0180160401 


0226340403 


Molecular 

Formula 

C.HfiO 


0225200804  CrHhN.O. 


0308760300  |  C17H13CIN4 


H8N2O4S 


C9H8O4 


Chemical  Structure 


n 


\  / 


H— N— H  WO 

I  oA 


1 


X  A 


Azithromycin 

0748981405 

C38H72N2O12 

Benzene 

0078120000 

CfiHe 

Benzyl  Penicillin 

0334390602 

C16H18N2O4S 

Caffeine 

0194190300 

C8H10N4O2 

Candoxatril 

0515640702 

C29H41NO7 

Carbamazepine 

0236270101 

C15H12N2O 

Sodium  Hydroxide 


0039990101 


HNaO 


Chloramphenicol 

0323130503 

C11H12CI2N2O5 

Cimetidine 

0252340403 

CioHieNeS 

Clonidine 

0230090102 

C9H9CI2N3 

Copper  Sulfate 

0159610400 

CUO4S 

Cyclosporine 

1202611205 

C62H111N11O12 

Desipramine 

0266380201 

C18H22N2 

Dexamethasone 

0392460603 

C22H29FO5 

Diazepam 


0284740200 


Ci6Hi3aN20 


Diclofenac 

0296150302 

C14H11CI2NO2 

Diltiazem-HCl 

0414520600 

C22H26N2O4S 

Doxonibicin 

0543521206 

C27H29NO1I 

Enalaprilat 

0376450602 

C20H28N2O5 

Erythromycin 

0733931405 

C37H67NO13 

H-Oy  ! 

H  C 

1  ^ 

b 

'"'rr  °y' 

--I1  ,  °  pN,-" 

H  0 

Ethylene  Glyeol 

0062070202 

C2H6O2 

H  ^0 

Famotidine 

0337450804 

C8H15N7O2S3 

H 

H 

X  / 

H  H 

Felodipine 

0384250501 

C18H19CI2NO4 

CK 

0 

j 

1  ^ 

X 

1 

H 

) 

Ferrie  Chloride 

0162200000 

CI3F 

% 

1 

.X' 

e 

Cl 

Fluorouraeil 

0130080302 

C4H3FN2O2 

0-^ 

1 

H 

.F 

^0 

Flurbiprofen 

0244260301 

C15H13FO2 

0  H 

Formaldehyde 

0030030100 

CH2O 

H 

24 


Furosemide 

0330740703 

C12H11CIN2O5S 

Gabapentin 

0171240302 

C9H17NO2 

Glycerol 

0092090303 

C3H8O3 

Hydrobromic  Acid 

0080910000 

BrH 

Hydrochloric  Acid 

0036460001 

HCl 

Hydrochlorothiazide 

0297740703 

C7H8CIN3O4S2 

Hydrofluoric  Acid 

0020010101 

FH 

Ibuprofen 

0206280201 

C13H16O2 

Imipramine 

0280410200 

C19H24N2 

Isopropyl  Alcohol 

0060100101 

CjHgO 

Itraconazole 


0705630900  C35H38CI2N8O4  c,  X 


X  >-N  Nl 


Ketoconazole 


Ketoprofen 


Labetalol-HCl 


Lisinopril 


Mannitol 


0531430600  I  C26H28CI2N4O4 


0254280301 


Magnesium  Sulfate  0120370400 


0182170606 


X> 


C16H14O3 


0328410404  C19H24N2O3 


0405490704  C21H31N3O5 


Mg04S 


o 


Methotrexate 


0454441205 


C20H22N8O5 


Metoprolol 

0267360402 

C15H25NO3 

Nadolol 

0309400504 

C17H27NO4 

Naloxone 

0327370502 

C19H21NO4 

Naproxen-sodium 

0230260301 

C14H14O3 

Nortriptylene-HCl 

0263380101 

C19H21NO4 

Omeprazole 

0345420601 

C17H19N3O3S 

Phenytoin 


0252270202  |  C15H12N2O2 


I  >< 


Piroxicam 


Potassium 

Permanganate 


Prazosin 


Quinidine 


0331350702  C15H13N3O4S 


Potassium  Bromide  011  9000 1 00 


0158030400  Mn04K 


Propranolol-HCl  0259350302  C16H21NO2 


0=Mn 


0383410801  C19H21N5O4 


H  O 


0324430401  C20H24N2O2 


Ranitidine-HCl  0314410702 


Silver  Nitrate 

0169870300 

AgNOs 

Sodium  Thiosulfate 

0158110400 

Na203S2 

Tenidap 

0320760402 

C14H9CIN2O3S 

Terfenadine 

0471690302 

C32H41NO2 

Testosterone 

0288430201 

C19H28O2 

Trovafloxaein 

0416361002 

C20H15F3N4O3 

Valproie-acid 

0144220201 

C8H16O2 

1 


Diseases 


Due  to  the  large  number  of  diseases  present  in  the  CTD,  the  diseases  were 
combined  into  27  groups  based  on  the  classifications  used  in  the  CTD.  Rather  than 
associating  each  chemical  with  every  associated  disease,  the  chemicals  were  related  to 
the  disease  group  that  contained  the  actual  associated  disease.  Each  disease  group  was 
assigned  a  number,  1-27,  to  represent  that  disease  group  in  the  network.  Table  2  shows 
the  27  diseases  groups  used  in  the  network  and  the  disease  identification  number  assigned 
to  each  group. 
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Table  2:  Disease  Groupings 


Disease  Categoiy 

Disease  Number 

Animal  Diseases 

1 

Bacterial  Infections  and  Mycoses 

2 

Cardiovascular  Diseases 

3 

Congenital,  Hereditary,  Neonatal  Diseases  and  Abnormalities 

4 

Digestive  System  Diseases 

5 

Environmental  Disorders 

6 

Endocrine  System  Diseases 

7 

Eye  Diseases 

8 

Female  Urogenital  Diseases  and  Pregnancy  Complications 

9 

Hemic  and  Lymphatic  Diseases 

10 

Immune  System  Diseses 

11 

Male  Urogenital  Diseases 

12 

Mental  Disorders 

13 

Musculoskelatal  Diseases 

14 

Neoplasms 

15 

Neiv'ous  System  Diseases 

16 

Nutritional  and  Metabolic  Diseases 

17 

Occupational  Diseases 

18 

Otorhinolaryngologic  Diseases 

19 

Parasitic  Diseases 

20 

Pathological  Conditions,  Signs  and  Symptoms 

21 

Respiratory  Tract  Diseases 

22 

Skin  and  Connective  Tissue  Diseases 

23 

Stomatognathic  Diseases 

24 

Substance-Related  Disorders 

25 

Virus  Diseases 

26 

Wounds  and  Injuries 

27 
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Artificial  Neural  Network 


Figure  2  shows  an  illustration  of  how  the  ANN  operated  for  the  MATLAB 
simulations.  The  ANN  took  the  input  and  output  data  for  eaeh  ehemieal-disease 
assoeiation  and  used  the  training  funetion  in  the  hidden  layer  to  update  the  weights  and 
biases  in  an  attempt  to  find  patterns  and  eorrelations  between  the  input  and  output  data. 
Based  on  the  pre-determined  values  in  the  TVT  ratio,  the  ANN  would  select  the 
designated  amount  of  data  to  train  the  network  with.  After  training,  the  ANN  would  then 
validate  the  network  with  the  designated  amount  of  data,  and  finally  test  the  network  with 
the  remaining  data.  In  Figure  2,  the  three  output  categories  (species,  dummy  variable, 
and  disease)  within  the  ANN  represent  the  actual  disease  outputs  obtained  from  the  CTD. 
The  outputs  on  the  outside  of  the  ANN  represent  the  outputs  generated  by  the  ANN 
during  the  testing  phase  of  each  simulation.  The  code  used  in  the  network  simulations 
can  be  found  in  Appendix  A. 
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Figure  2:  Artificial  Neural  Network  Example 
Input  and  Output  Data 

Before  running  the  network  simulations  in  MATLAB,  the  data  were  first 
formatted  to  fit  the  ANN  requirements  as  defined  by  the  MATLAB  user  guide.  The  input 
and  actual  associated  output  data  from  the  CTD  were  entered  into  two  matrices  created  in 
Microsoft  Excel  from  which  MATLAB  was  coded  to  retrieve  the  data.  The  input  data 
were  entered  into  a  matrix  with  three  columns,  one  each  for  molecular  weight,  hydrogen 
acceptors,  and  hydrogen  donors.  These  three  chemical  characteristics  were  left  as 
individual  data  points  for  the  input  data,  rather  than  being  entered  in  one  column  as  the 
new  classification  number,  to  keep  the  size  of  the  input  and  output  matrices  the  same  and 
reduce  the  use  of  dummy  variables.  The  output  data  were  entered  into  a  matrix  the  same 
size  as  the  input  data  matrix  with  columns  for  species,  dummy  variable,  and  disease 
group.  Species  were  given  numbers,  similar  to  the  disease  groups,  and  the  species  and 
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their  assigned  numbers  can  be  found  in  Table  3.  The  species  number  is  specifically 
related  to  the  species  that  the  chemical-disease  association  occurs  in.  For  example, 
acetone  has  a  curated  relationship  with  neoplasms  (disease  category  15)  and  this 
chemical-disease  association  occurs  in  humans.  A  dummy  variable  was  used  in  the 
output  matrix  to  keep  the  matrix  the  same  size  as  the  input  data  matrix.  Zeros  were 
entered  for  the  dummy  column  values  and  the  dummy  column  was  not  used  in  any  of  the 
results  analysis. 


Table  3:  Species  Table 


Species 

Species  Number 

Humans 

1 

Dogs 

2 

Fish 

3 

Birds 

4 

RatE.-'mice 

5 

The  number  of  rows  used  for  each  chemical  was  determined  by  the  number  of 
disease  groups  the  chemical  was  associated  with.  For  example,  acetone  was  associated 
with  six  diseases  groups  so  it  used  six  rows  in  both  the  input  and  output  matrices.  The 
actual  input  and  output  data  used  for  acetone  can  be  seen  in  Table  4.  The  six  rows  used 
in  the  input  matrix  all  contained  the  same  molecular  weight,  hydrogen  acceptor,  and 
hydrogen  donor  data.  Each  of  the  six  rows  in  the  output  matrix  corresponded  to  one  of 
the  associated  disease  groups  and  subsequent  related  species.  The  complete  input  and 
output  matrices  can  be  found  in  Appendix  B. 


35 


Table  4:  Acetone  Input  and  Output  Matrices 


Inputs _  _ Outputs 


Molecular 

Weight 

Hydrogen 

Acceptors 

Hydrogen 

Donors 

58.08 

1 

1 

58.08 

1 

1 

58.08 

1 

1 

58.08 

1 

1 

58.08 

1 

1 

58.08 

1 

1 

Species 

Dummy 

Disease 

Group 

1 

0 

9 

1 

0 

12 

1 

0 

15 

1 

0 

16 

1 

0 

17 

1 

0 

25 

Simulations 


Three  phases  of  simulations  were  used  to  test  the  ANN.  The  first  phase  of 
simulation  was  accomplished  to  demonstrate  that  the  MATLAB  ANN  could  be  used  to 
analyze  the  chemical  and  disease  data  and  find  appropriate  correlations.  Initially,  20 
chemicals  were  chosen  from  the  CTD  and  the  input  and  output  matrices  were  created 
based  on  the  chemical  characteristics  and  associated  species  and  diseases  groups.  This 
simulation  used  the  basic  MATLAB  ANN  code  formatted  with  a  TVT  ratio  of  80-10-10 
percent  and  the  default  training  function.  Results  from  the  first  phase  simulation  can  be 
found  in  Figures  3-9  in  Chapter  4. 

The  second  phase  of  simulations  involved  testing  different  TVT  ratios  and 
training  functions  on  the  network  which  provided  the  best  network  performance.  To  do 
this,  an  additional  55  chemicals  were  chosen  at  random  from  the  CTD  and  the 
appropriate  chemical,  species,  and  disease  group  data  were  added  to  the  input  and  output 
matrices.  First,  five  different  TVT  ratios  were  tested  in  the  network  using  the  default 
training  function:  50-25-25,  60-20-20,  70-15-15,  80-10-10,  and  90-5-5  percent.  These 
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five  ratios  were  chosen  so  that  an  appropriate  TVT  ratio  could  be  chosen  while  also 
highlighting  how  the  different  ratios  affect  the  ANN.  Next,  each  of  the  15  training 
functions  was  tested  in  the  network  using  the  70-15-15  percent  TVT  ratio.  The 
MATLAB  training  functions  used  are  preprogrammed  functions  built  within  MATLAB. 
All  of  the  functions  operated  in  a  feedforward  network  with  backpropagation.  A  feed 
forward  network  is  where  the  data  is  passed  through  the  hidden  layer  in  a  single  direction 
from  the  input  side  to  the  output  side.  The  backpropagation  step  involves  going  back  to 
adjust  the  weights  and  biases  in  the  hidden  layer  after  comparing  the  actual  outputs  to  the 
ANN  derived  outputs.  Each  of  the  training  functions  are  based  off  of  a  gradient  descent 
algorithm  where  the  training  function  attempts  to  decrease  the  error  in  the  network  by 
adjusting  the  weights  and  biases  after  each  network  iteration.  The  results  from  the 
different  training  ratios  indicated  that  the  trainlm,  Levenberg-Marquardt 
backpropagation,  function  generated  the  best  network  performance. 

The  third  phase  of  simulations  involved  inputting  uncurated  chemical  data  into 
the  network  that  had  no  known  disease  associations  to  see  what  diseases  the  network 
would  predict.  To  accomplish  this,  a  network  simulation  was  first  run  with  the  original 
75  chemicals  using  the  70-15-15  percent  TVT  ratio  and  trainlm  training  function.  The 
70-15-15  percent  TV  ratio  and  trainlm  training  function  provided  the  best  network 
performance  in  the  second  phase  of  simulations.  This  established  the  correlations, 
weights,  and  biases  for  the  network  to  use  on  the  uncurated  chemicals.  Then  the  data  for 
three  uncurated  chemicals  were  run  through  the  network  10  times  and  the  network- 
derived  disease  output  results  for  each  trial  were  recorded.  Each  uncurated  chemical  was 
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input  into  the  network  10  times  to  allow  multiple  disease  association  predictions  to  occur 
in  the  event  that  a  chemical  was  associated  to  more  than  one  disease.  The  network 
outputs  generated  had  to  be  rounded  to  directly  correlate  them  to  the  whole  number 
designators  assigned  to  each  disease  group.  Several  network  outputs  produced  the  same 
disease  group  more  than  once  for  a  given  input  chemical.  For  duplicate  disease 
predictions,  these  outputs  were  consolidated  into  one  output.  The  derived  disease  outputs 
were  then  compared  to  research  literature  to  see  if  the  network  predictions  were  correct. 

Analysis  and  Results 

Once  the  simulations  were  complete  in  MATLAB,  all  of  the  output  data  was 
copied  into  Microsoft  Excel  for  analysis.  Analyzing  the  first  phase  of  simulations  was 
done  simply  by  reviewing  the  output  plots  and  figures  generated  by  MATLAB  at  the 
completion  of  the  simulation.  The  analyses  of  the  phase  two  simulations  used  Microsoft 
Excel  to  plot  and  chart  the  output  data  from  MATLAB.  The  primary  graph  used  plotted 
the  actual  diseases  values  versus  the  ANN  derived  disease  values.  When  the  network 
produced  accurate  output  results,  the  plot  would  follow  a  linear,  one-to-one  slope  on  the 
graph.  Excel  was  also  used  to  calculate  the  coefficient  of  determination  (R  )  values  to 
see  how  well  the  data  followed  a  linear  progression.  The  analysis  also  compared  the 
different  TVT  ratios  and  training  functions  based  on  network  parameter  values  to 
determine  which  ratio  or  function  provided  the  best  network  performance.  Additionally, 
the  effects  of  undertraining  and  overtraining  the  network  on  ANN  derived  disease  values 
were  taken  into  account.  The  third  phase  of  simulations  was  primarily  analyzed  by 
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comparing  the  ANN  derived  disease  values  to  research  literature  in  an  effort  to  show  that 
the  network  had  some  predictive  capability  for  uncurated  chemicals. 
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V.  Analysis  and  Results 


ANN  Training  Results 

The  following  figures  were  generated  from  a  typical  MATLAB  ANN  simulation 
using  the  default  training  settings  and  a  TVT  ratio  of  80-10-10  percent.  They  show 
typical  results  seen  during  the  first  phase  of  simulations  when  the  initial  group  on  20 
chemicals  was  used  in  the  ANN. 

Figure  3  shows  a  typical  MATLAB  training  session  for  an  ANN.  The  neural 
network  diagram  shows  a  pictorial  of  how  the  network  will  function,  given  the 
requirements  established  in  the  network  code.  At  the  top  of  the  figure,  the  diagrams 
shows  the  number  of  inputs  and  outputs  being  used  and  the  number  of  hidden  layers.  The 
data  division  and  derivative  algorithms  are  preset  within  MATLAB  and  these  default 
settings  were  used  for  the  various  simulations.  The  training  algorithm  defines  the  training 
function  used  in  the  network,  which  dictates  how  the  data  will  be  trained.  The 
performance  algorithm  measures  how  well  the  network  is  operating  during  training.  The 
training  and  performance  algorithms  can  be  changed  separately;  however,  a  default 
performance  algorithm  will  be  chosen  based  on  the  training  algorithm  if  a  specific 
performance  algorithm  is  not  defined. 
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Figure  3:  Typical  MATLAB  ANN  Training  Session 
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The  progress  portion  of  Figure  3  shows  the  network  performanee  eriteria  that 
provide  information  on  the  progress  of  network  training.  As  the  simulation  progresses, 
these  performance  criteria  update  to  show  the  current  status  of  the  training.  The  epochs 
show  the  number  of  iterations  over  the  course  of  the  simulation.  Performance  shows  how 
accurately  the  network  is  generating  output  values  compared  to  the  actual  output  values. 
The  performance  values  are  the  mean-squared  error  of  the  network  so  lower  performance 
values  indicate  higher  network  training  performance.  The  gradient  is  an  optimization 
algorithm  that  measures  the  adjustments  made  to  the  network  in  relation  to  the  network 
performance  during  each  epoch.  If  a  network  is  performing  well,  smaller  adjustments  are 
needed  so  the  gradient  will  be  smaller.  Likewise,  if  a  network  is  performing  poorly, 
larger  adjustments  are  needed  to  attempt  to  improve  performance  so  the  gradient  will  be 
larger.  The  mu  value  shows  how  much  the  network  is  required  to  change  the  weights  and 
biases  placed  on  the  input  data  to  achieve  accurate  output  results.  The  weights  and  biases 
are  used  to  adjust  and  determine  how  the  input  data  is  used  when  training  the  network.  If 
the  network  finds  some  input  data  provides  better  performance  than  other  data,  it  will 
place  more  weight  on  the  data  that  increases  performance.  When  a  network  has  high  mu 
values  during  training,  the  network  is  struggling  to  find  weight  and  bias  values  that  work 
for  the  data  set.  The  validation  checks  show  the  number  of  iterations  where  the 
simulation’s  validation  performance  does  not  decrease.  After  the  network  is  trained,  it  is 
validated  to  ensure  the  training  was  successful. 

All  of  these  performance  criteria  have  upper  and  lower  boundaries  and  when  the 
appropriate  boundary  of  one  of  the  criteria  is  reached,  the  simulation  is  terminated. 
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Performance  and  gradient  will  both  terminate  the  simulation  when  they  reach  their  lower 
limit  values.  If  performance  or  gradient  terminate  the  simulation,  this  generally  indicates 
a  successful  training  because  lower  performance  and  gradient  values  correlate  to  higher 
network  performance.  Epochs,  mu  and  validation  checks  will  terminate  the  simulation 
when  they  reach  their  upper  limits.  When  the  number  of  epochs  terminates  the 
simulation,  the  network  has  used  all  of  the  allotted  iterations  before  the  performance  or 
gradient  have  reached  an  acceptable  level.  Network  termination  due  to  a  high  mu  value 
indicates  that  the  network  failed  to  find  appropriate  weight  and  bias  values.  Terminating 
for  a  high  number  of  validation  checks  indicates  the  network  did  not  train  well  because  it 
was  unable  to  be  validated.  Time  is  the  only  criterion  that  does  not  terminate  a  simulation 
as  it  is  simply  meant  to  track  how  long  the  simulation  takes.  Once  the  network  training  is 
complete,  there  are  several  figures  that  MATLAB  can  create  which  provide  additional 
information  about  the  network  simulation.  Figures  4-9  show  the  different  plots  that  can 
be  generated  by  MATLAB  after  the  network  simulations.  Additional  training  sessions 
and  supporting  figures  can  be  found  in  Appendix  C. 

Figure  4  shows  a  typical  MATLAB-generated  plot  of  the  actual  disease  numbers 
plotted  against  the  ANN  derived  disease.  When  the  plot  shows  a  straight,  positive,  one  to 
one  slope,  the  ANN  derived  disease  number  is  close  to  the  actual  disease  number.  The 
straight,  one  to  one  slope  seen  in  the  figure  suggests  that  the  network  was  able  to 
correctly  derive  the  appropriate  disease  number  for  each  chemical. 
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The  relationship  between  actuall  and  predicted  diseases,  version  1 


Figure  4:  Typical  MATLAB  Actual  Disease  versus  ANN  Derived  Disease  Outputs 

Plot 

Figures  5  and  6  show  the  performance  (i.e.  mean  squared  error)  of  the  ANN 
simulation  discussed  from  Figure  3.  The  two  plots  look  identical  because  the  network 
used  the  mean  squared  error  to  judge  the  performance  of  the  network  as  it  was  trained. 
The  lower  the  performance  value,  or  mean  squared  error  value,  the  better  the  network  is 
performing  because  they  is  less  variation  between  the  actual  disease  groups  numbers  and 
the  ANN  derived  disease  group  numbers.  The  figures  show  the  performance  of  the 
network  for  each  epoch  during  the  simulation.  During  the  simulation,  the  training 
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performance  decreases  until  epoch  1 1  where  the  performance  values  plateau.  In  Figure  5, 
the  validation  and  test  curves  follow  the  same  path  as  the  train  performance  curve. 


Figure  5:  Typical  MATLAB  ANN  Performance  Plot 
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Best  Training  Performance  is  0.1 7704  at  epoch  16 


_ 16  Epochs _ 

Figure  6:  Typical  MATLAB  ANN  Mean  Squared  Error  Plot 

Figure  7  shows  a  typical  regression  plot  produced  from  an  ANN  network 
simulation.  Similar  to  Figure  4,  a  positive,  straight,  one  to  one  slope  is  desired  to  show 
that  the  ANN  derived  disease  numbers  match  the  actual  disease  numbers.  In  this  plot,  the 
target,  x-axis,  values  represent  the  actual  disease  numbers  and  the  output,  y-axis,  values 
represent  the  ANN  derived  disease  numbers.  The  training  R-value  shows  how  closely  the 
derived  numbers  compare  to  the  actual  numbers.  An  R-value  of  one  would  indicate  a 
perfect  match  of  the  derived  to  the  actual  disease  number  so  the  higher  the  R-value,  the 
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more  accurate  the  Ann  derived  disease  numbers  are.  The  high  R-value  of  0.9985  for  this 


simulation  corresponds  to  the  straight,  one  to  one  slope  seen  in  the  graph. 


Figure  7:  Typical  MATLAB  ANN  Regression  Plot 

Figure  8  shows  how  the  gradient,  mu,  and  number  of  validation  checks  values 
fluctuation  during  the  course  of  the  network  simulation.  The  decreasing  gradient  and  mu 
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values  both  indicate  the  network  was  performing  well.  The  validation  checks  remaining 
at  zero  also  indicated  that  the  network  training  was  performing  well.  In  this  simulation, 
from  Figure  3,  the  gradient  parameter  was  the  termination  factor  because  it  reached  its 
lower  limit. 


Figure  9  shows  how  the  gradient,  mu,  and  number  of  validation  checks  values 
fluctuation  during  the  course  of  the  network  simulation.  The  decreasing  gradient  and  mu 


values  both  indicate  the  network  was  performing  well.  The  validation  checks  remaining 
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at  zero  also  indicated  that  the  network  training  was  performing  well.  In  this  simulation, 
from  Figure  3,  the  gradient  parameter  was  the  termination  factor  because  it  reached  its 
lower  limit. 


Error  Histogram  with  20  Bins 
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Figure  9:  Typical  MATLAB  ANN  Error  Histogram  Plot 

After  the  ANN  was  shown  to  be  capable  of  fitting  chemical  and  disease  input  and 
output  data  in  the  model.  The  TVT  ratios  and  training  functions  were  tested  in  the 
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second  phase  of  simulations.  Using  the  TVT  ratio  as  an  independent  variable  yields  a 
wide  range  of  training  performance  parameter  values.  Table  5  shows  the  effects  that 
different  TVT  ratios  can  have  on  the  network  performance  criteria.  From  the  data 
collected,  the  70-15-15  TVT  ratio  provided  the  best  overall  performance  for  the  training 
of  the  network.  The  higher  average  number  of  epochs  associated  with  the  70-15-15  TVT 
ratio  shows  that  this  ratio  allowed  the  network  more  opportunity  to  improve  with  more 
simulation  iterations  than  the  other  ratios.  While  the  60-20-20  TV  ratio  had  a  lower 
average  performance  value,  it  had  a  higher  average  mu  value  and  a  higher  average 
number  of  validation  checks  indicating  that  the  60-20-20  ratio  did  not  adequately 
establish  proper  weight  values  and  could  not  continue  to  improve  performance  of  the 
network.  Although  the  70-15-15  TVT  ratio  had  the  highest  average  gradient  value,  it  had 
the  lowest  average  mu  and  lowest  number  of  validation  checks  of  the  five  TVT  ratios. 
Table  5  shows  that  over  the  five  network  training  criteria,  the  70-15-15  TVT  ratio 
provides  the  best  overall  performance  with  that  data. 
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Table  5:  The  Effect  of  TVT  Ratio  on  Network  Training  Statistics 


Ratio 

Network  Training  Parameters 

Epochs 

Time 

Performance 

Gradient 

Mu 

Validation 

Checks 

Minimum 

3 

37.75 

3.929 

1.174 

2.00E-07 

1 

50-25-25 

Maximum 

8 

208.05 

92.510 

642.100 

2.00E+04 

6 

Average 

4.4 

90.07 

37.277 

192.282 

2.91E+03 

3 

Minimum 

4 

69.48 

0.380 

0.213 

6.02E-06 

4 

60-20-20 

Maximum 

12 

219.66 

96.450 

993.201 

2.40E+05 

6 

Average 

7.8 

146.72 

20.792 

310.266 

6.92E+04 

5.2 

Minimum 

10 

154.85 

0.177 

0.000 

8.20E-07 

0 

70-15-15 

Maximum 

14 

256.19 

90.037 

2326.863 

2.40E+02 

0 

Average 

11.6 

195.68 

24.871 

422.535 

2.00E+01 

0 

Minimum 

6 

92.27 

16.926 

0.000 

6.00E-04 

0 

80-10-10 

Maximum 

19 

276.79 

95.657 

810.218 

2.01E+04 

0 

Average 

9.8 

150.64 

46.503 

172.010 

2.12E+03 

0 

Minimum 

2 

32.45 

0.000 

0.001 

4.02E-04 

0 

90-5-5 

Maximum 

9 

192.81 

94.892 

512.904 

2.20E+05 

6 

Average 

4.4 

101.67 

42.661 

120.958 

4.69E+03 

3.4 

2 

Figure  10  shows  the  R  -values  for  the  five  different  TVT  ratios  used  in  network 
simulations.  The  70-15-15  percent  ratio  clearly  provides  the  best  fit  for  the  data  with  a 
high  R-value  of  nearly  one.  The  80-10-10  percent  ratio  also  performs  well  but  has  a  lower 
R  -value  than  the  70-15-15  percent  ratio.  The  other  ratios  have  low  R  -values  and  the 
plots  of  the  ANN  derived  disease  numbers  versus  actual  disease  numbers  show  the 
inaccuracy  of  those  models.  Individual  plots  for  each  of  the  TVT  ratios  used  can  be  found 
in  Appendix  D. 
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Figure  10:  TVT  Ratio  Effect  on  the  Coefficient  of  Determination 

In  addition  to  TVT  ratios,  altering  the  training  functions  used  to  train  the  network 
can  also  affect  network  performance.  Table  6  shows  the  effects  that  different  training 
functions  have  on  the  network  performance  criteria.  When  conducting  simulations  with 
the  different  training  functions,  the  70-15-15  TV  ratio  was  used  to  organize  the  data  as  it 
was  determined  to  provide  the  best  performance  with  the  network.  Explanations  for  all  of 
the  training  functions  used  can  be  found  in  Appendix  E.  From  the  data  collected  in  the 
simulations,  the  trainlm  provided  the  best  overall  training  performance  for  the 
simulations.  Because  not  all  of  the  training  functions  used  gradient,  mu  or  validations 
checks  as  performance  parameters,  epochs,  time  and  performance  were  the  main 
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parameters  used  to  compare  the  different  functions.  The  maximum  number  of  epochs  was 
used  to  terminate  the  simulations  for  10  of  the  15  training  functions  used  in  the  network. 
While  the  trainlm  training  function  did  not  have  the  lowest  performance  value,  its 
simulations  were  terminated  due  to  the  gradient  reaching  the  lower  limit  indicating  high 
network  performance.  Training  function  traingscg_had  a  lower  performance  parameter 
but  also  had  a  lower  R-value  when  the  actual  and  ANN  derived  disease  numbers  were 
plotted.  It  should  also  be  noted  that  training  functions  with  shorter  network  run  times 
generally  performed  worse  than  functions  with  longer  run  times. 


53 


Table  6:  The  Effects  of  Training  Functions  on  Network  Training  Statistics 


Function 

Performance  Parameters 

Epochs 

Time 

Performance 

Gradient 

Mu 

Validation  Checks 

Minimum 

1000 

32.94 

85.55 

- 

- 

0 

trainb 

Maximum 

1000 

47.24 

102.43 

- 

- 

0 

Average 

1000 

38.27 

93.48 

- 

- 

0 

Minimum 

1000 

47.95 

65.06 

- 

- 

- 

trainc 

Maximum 

1000 

306.49 

101.05 

- 

- 

- 

Average 

1000 

214.66 

79.73 

- 

- 

- 

Minimum 

3 

0.18 

64.27 

2.68 

- 

0 

traincgb 

Maximum 

13 

0.22 

88.41 

416.71 

- 

0 

Average 

9 

0.20 

71.86 

260.10 

- 

0 

Minimum 

101 

0.75 

40.37 

1.07 

- 

0 

traincgf 

Maximum 

659 

4.90 

88.26 

457.27 

- 

0 

Average 

290 

2.17 

47.37 

72.29 

- 

0 

Minimum 

5 

0.26 

55.00 

1.68 

- 

0 

traincgp 

Maximum 

161 

1.62 

89.20 

361.27 

- 

0 

Average 

88 

0.94 

59.85 

38.10 

- 

0 

Minimum 

1000 

2.85 

48.12 

0.93 

- 

0 

traingd 

Maximum 

1000 

2.86 

100.66 

229.21 

- 

0 

Average 

1000 

2.86 

61.70 

2.30 

- 

0 

Minimum 

1000 

2.99 

68.21 

195.53 

- 

0 

traingda 

Maximum 

1000 

3.30 

146.91 

408.98 

- 

0 

Average 

1000 

3.18 

73.10 

329.39 

- 

0 

Minimum 

1000 

3.11 

48.10 

0.94 

- 

0 

traingdm 

Maximum 

1000 

3.56 

104.42 

576.00 

- 

0 

Average 

1000 

3.27 

84.38 

2.55 

- 

0 

Minimum 

1000 

3.25 

78.80 

3.17 

- 

0 

traindgx 

Maximum 

1000 

3.27 

95.30 

602.00 

- 

0 

Average 

1000 

3.59 

80.53 

133.58 

- 

0 

Minimum 

10 

154.85 

0.18 

0.00 

0.00 

0 

trainlm 

Maximum 

14 

256.19 

91.97 

2143.91 

400.00 

0 

Average 

13 

219.36 

27.91 

394.93 

33.25 

0 

Minimum 

1000 

6.35 

0.34 

0.00 

- 

0 

trainoss 

Maximum 

1000 

10.19 

99.07 

550.00 

- 

0 

Average 

1000 

7.65 

36.37 

2.49 

- 

0 

Minimum 

1000 

52.66 

64.00 

- 

- 

- 

trainr 

Maximum 

1000 

87.99 

101.85 

- 

- 

- 

Average 

1000 

64.48 

79.43 

- 

- 

- 

Minimum 

45 

0.27 

42.00 

0.00 

- 

0 

trainrp 

Maximum 

49 

0.35 

99.69 

524.00 

- 

0 

Average 

47 

0.31 

47.37 

5.53 

- 

0 

Minimum 

1000 

21.25 

83.90 

- 

- 

- 

trains 

Maximum 

1000 

25.84 

100.76 

- 

- 

- 

Average 

1000 

23.35 

91.88 

- 

- 

- 

Minimum 

1000 

4.60 

0.44 

0.12 

- 

0 

trainscg 

Maximum 

1000 

7.86 

95.46 

545.60 

- 

0 

Average 

1000 

5.87 

25.54 

9.57 

- 

0 
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2 

Figure  1 1  shows  the  R  -values  for  each  of  the  training  functions  used  in  the 
network  simulations.  Several  of  the  training  functions  had  good  R  -values,  above  0.7,  but 
the  majority  fell  around  or  below  0.5  indicating  most  training  functions  did  not  have  a 
good  fit  with  the  data.  The  Trainlm  function  had  the  highest  R  -value  of  0.999.  Trainrp 
also  had  an  R  -value  of  0.999  but  the  slope  of  the  line  0.5:1,  not  1:1.  Trainscg  had  the 
second  highest  R-value  of  0.9769  but  the  ANN  derived  disease  number  versus  actual 
disease  number  plot  was  not  as  linear  as  Trainlm.  Plots  for  all  of  the  training  functions 
can  be  found  in  the  Appendix. 


Figure  11:  Training  Function  Effect  on  the  Coefficient  of  Determination 


55 


ANN  Model  Performance  for  Curated  Chemicals 


Utilizing  curated  chemicals  from  the  CTD  allowed  various  network  models  to  be 
created  testing  different  TVT  ratios.  The  actual  disease  number  is  known  when  using 
curated  chemicals  so  the  ANN  derived  disease  numbers  can  easily  be  compared  to  the 
actual  values  to  determine  how  well  the  network  is  performing. 

Effect  of  Training  Ratio  on  Model-Predicted  Disease 

Figure  12  shows  typical  actual  disease  number  versus  ANN  derived  disease 
number  plots  for  each  of  the  five  TVT  ratios  used.  For  the  simulations  shown,  default 
training  functions  and  parameters  were  used  in  the  network.  The  graph  also  includes  a 
one-to-one  slope  line  to  easily  compare  the  different  simulations  results  to  the  desired 
values.  As  discussed  earlier,  the  70-15-15  percent  TVT  ratio  generated  the  best  network 
performance.  Comparing  the  different  TVT  ratio  plots,  it  is  evident  that  the  other  ratios 
do  not  produce  the  same  performance  as  the  70-15-15  percent  ratio  and  are  unable  to 
accurately  generate  ANN  derived  disease  numbers. 
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Figure  12:  Effect  of  TVT  Ratio  on  MATLAB  ANN  Derived  Disease 


Figure  12  also  highlights  three  structurally  different  chemicals  used  in  the 
network  simulations  showing  how  their  ANN  derived  disease  numbers  compare  to  their 
actual  disease  numbers  when  using  the  70-15-15  percent  TVT  ratio.  Two  important 
details  about  the  network  can  be  seen  when  examining  the  three  highlighted  chemicals. 
First,  the  network  is  able  to  take  inputs  from  different  chemicals  and  correctly  match  it  to 
a  single  disease  category.  Potassium  permanganate  and  ethylene  glycol  can  both  be 
correctly  linked  to  disease  group  five  and  ethylene  glycol  and  candoxatril  can  both  be 
correctly  linked  to  disease  category  16.  In  addition  to  matching  multiple  chemicals  to 
one  disease  category,  the  network  also  correctly  took  inputs  from  a  single  chemical  and 
linked  it  to  multiple  diseases  that  it  is  related  to.  Both  ethylene  glycol  and  candoxatril  are 
shown  to  correctly  have  associations  with  multiple  disease  categories. 

Chemical  Trends  for  Undertrained  Model  Simulations 

Figure  13  shows  the  effects  of  using  an  undertrained  network  on  the  ANN  derived 
disease  number  using  the  average  of  five  trials  of  simulation  data.  Nearly  all  of  the  ANN 
derived  disease  numbers  fall  below  the  one  to  one  slope  line  which  correlates  to  the  low 
R  -value  seen  in  Figure  4.8.  For  every  disease  associated  with  each  of  the  three 
chemicals,  the  ANN  derived  disease  number  was  lower  than  the  actual  value.  The 
network  was  unable  to  derive  the  same  disease  number  for  multiple  chemicals  that  were 
linked  to  the  same  disease.  When  the  network  did  not  have  enough  data  available  to 
properly  train  with,  the  performance  of  the  network  suffered  and  produced  inaccurate 
results. 
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Figure  13:  Effect  of  Undertrained  TVT  Ratio  (50-25-25  %)  on  MATLAB  ANN  Derived  Disease 


Figure  13  also  shows  the  three  chemicals  highlighted  in  Figure  12  to  display  how 
the  network  derives  disease  numbers  for  them  when  not  provided  with  enough  training 
data.  When  examining  the  three  chemical’s  ANN  derived  disease  numbers,  it  is  evident 
that  the  undertrained  network  is  not  able  to  predict  the  correct  values.  When  more  than 
one  chemical  is  linked  to  the  same  disease,  the  network  is  unable  to  predict  the  same 
disease  for  the  multiple  chemicals.  Potassium  permanganate  and  ethylene  glycol  should 
both  be  related  to  disease  category  5  but  the  network  predicts  values  near  3.5  and  1.5 
respectively.  Not  providing  enough  data  for  the  network  to  train  with  negatively  affects 
the  overall  performance  of  the  network. 

Occasionally,  less  than  0.5%  of  the  time,  the  network  will  generate  a  disease 
value  close  to  the  actual  value,  but  this  does  not  occur  on  a  consistent  basis.  It  is  also  not 
consistent  for  a  certain  chemical  or  disease.  For  example,  in  one  simulation,  the  network 
generated  a  disease  number  of  14.2692  for  formaldehyde.  Formaldehyde  is  related  to 
disease  category  15  (neoplasms)  so  this  is  a  difference  of  4.9%.  Formaldehyde  is  also 
related  to  23  other  disease  groups  and  the  closest  network  generated  disease  number  out 
of  the  remaining  23  was  25.6%  off  the  actual  number.  Out  of  the  five  simulations  run 
with  the  network,  this  was  also  the  only  time  a  number  within  5%  of  15  was  generated  for 
formaldehyde. 
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Chemical  Trends  for  Overtrained  Model  Simulations 


Figure  14  shows  the  effects  of  using  an  overtrained  network  on  the  ANN  derived 
disease  number.  The  results  shown  in  Figure  14were  obtained  using  a  TVT  ratio  of  90-5- 
5  percent  and  the  default  network  training  settings.  Similar  to  the  results  seen  in  Figure 
13,  the  network  was  unable  to  generated  correct  disease  numbers  and  nearly  all  of  the 
generated  values  fell  below  the  one  to  one  slope  line.  The  overtrained  network  saw 
similar  failures  when  choosing  disease  number  for  multiple  chemicals  associated  with 
one  disease.  Potassium  permanganate  and  ethylene  glycol  should  have  had  a  disease 
number  of  five  generated  by  the  network,  but  instead  values  of  2.75  and  1.75  were 
generated  for  the  two  chemicals,  respectively. 
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Figure  14:  Effect  of  Overtrained  TVT  Ratio  (90-5-5  %)  on  MATLAB  ANN  Derived  Disease 


ANN  Model  Performance  for  Uncurated  Chemicals 


Three  uncurated  chemicals  from  the  CTD  were  selected  to  be  tested  by  the  ANN 
using  a  70-15-15  TVT  ratio  and  the  trainlm  training  function.  While  these  chemicals 
were  uncurated  in  the  database,  there  is  research  to  suggest  that  they  are  related  to  certain 
diseases.  Testing  the  network  with  these  chemicals  gives  the  ability  to  determine  how 
well  the  model  can  predict  chemical-disease  associations. 

Figure  15  shows  the  network  results  when  the  uncurated  chemical  cystaphos  was 
input  into  the  network.  From  the  10  inputs,  the  network  generated  seven  possible  disease 
output  numbers.  Of  the  seven  predicted  outputs,  neoplasm  was  generated  by  the  network 
and  had  literature  from  previous  research  to  support  this  network  prediction.  For 
example,  the  Defense  Threat  Reduction  Agency  (DTRA)  in  2006  conducted  a  review  of 
previous  Soviet  Union  research  involving  biological  actions  of  neutron  radiobiology. 
DTRA  discovered  several  trials  where  animals  were  injected  with  cystaphos  or  cystaphos 
mixed  with  other  chemicals.  These  trials  included  results  that  showed  cystaphos  was 
capable  of  protecting  the  intestinal  system  in  mice  from  unwanted  radiation  damage 
(Defense  Threat  Reduction  Agency,  2006).  In  addition  to  the  DTRA  report,  Barkaia  et 
al.  conducted  research  in  1989  involving  cystaphos  as  an  adjuvant  in  cancer  treatment. 
Using  mice,  guinea  pigs,  monkeys,  they  injected  the  animals  with  cystaphos  combined 
with  sodium  nitrite  and  mexamine  after  irradiating  them  with  Cs-137  gamma  rays. 
Barkaia  et  al.  then  monitored  the  radiation  sickness  exhibited  by  the  animals  and  repeated 
the  cystaphos  solution  injections  to  see  if  the  radiation  sickness  lessened.  From  their 
experiments,  they  found  that  with  repeated  injections,  the  cystaphos  solution  helped  to 
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protect  healthy  cells  in  the  bone  marrow,  spleen,  and  intestine  of  the  test  animals  (Barkaia 
et  al.,  1989).  The  findings  of  Barkaia  et  al.  mimic  those  mentioned  in  the  DTRA  report 
in  that  cystaphos  appears  to  reduce  the  harmful  effects  of  radiation  and  protect  the 
intestinal  system  in  mice.  The  results  presented  from  these  two  studies  support  the 
network  model  prediction  that  cystaphos  is  linked  to  neoplasms.  This  identified  potential 
link  does  not  indicate  that  cystaphos  causes  cancer,  but  rather  it  is  connected  to  it  in  ways 
that  are  not  fiilly  understood.  The  peer-reviewed  literature  does  not  contain  studies  that 
have  examined  the  connection  between  cystaphos  and  any  of  the  other  predicted  diseases. 
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Figure  15:  MATLAB  ANN  Derived  Diseases  for  NCC  Cystaphos 

Figure  16  shows  the  network  results  when  the  uncurated  chemical  3,5-dibromo-2- 
(2,4-dibromophenoxy)phenol  (6-HO-BDE-47)  was  input  into  the  network.  From  the  10 
inputs,  the  network  generated  nine  possible  disease  output  numbers  with  nervous  system 
diseases  being  supported  by  previous  research.  Hendriks  et  al.,  2010  studied  the  use  of 
polybrominated  diphenyl  ethers  (PDF)  to  stimulate  nicotinic  acetylcholine  (nACh)  and 
GABA(A)  receptors  on  neurons  in  the  brain.  Hendriks  et  al.  took  several  PDFs, 
including  6-HO-BDF-47,  and  conducted  tests  investigating  their  affects  on  the  nACH  and 
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GABA(A)  receptors.  Their  findings  documented  that  6-HO-BDE-47  was  an  antagonist  to 
the  nACH  receptors  yet  acted  as  an  agonist  to  GABA(A)  receptors.  The  results  showing 
that  6-HO-BDE-47  can  be  linked  to  nervous  systems  diseases  from  the  work  of  Hendriks 
et  al.  supports  the  same  prediction  generated  by  the  ANN  model.  In  addition  to  being 
related  to  nervous  system  diseases,  there  is  also  literature  to  support  a  possible  connection 
to  endocrine  diseases.  The  network  model  did  not  predict  endocrine  diseases  but  an 
investigation  conducted  by  Cao  et  al.,  2010  indicates  that  this  may  be  a  possibility.  Cao  et 
al.  took  PDEs  that  were  known  to  cause  thyroid  hormone  disruption  and  tested  to  see  if 
they  bind  to  hormone  transport  proteins.  Their  results  showed  that  6-OH-PDE-47  had  an 
affinity  to  binding  with  the  thyroid  hormone  transport  protein  which  could  cause 
endocrine  system  problems.  Although  the  network  predictions  parallel  the  findings  of 
Hendriks  et  al.  and  Cao  et  al.,  the  model  is  only  able  to  potentially  link  6-OH-PDE-47  to 
nervous  system  and  endocrine  system  diseases.  It  does  not  indicated  that  6-OH-PDE-47 
directly  causes  these  diseases.  The  peer-reviewed  literature  does  not  contain  studies  that 
have  examined  the  connection  between  6-HO-PDE-47  and  any  of  the  other  predicted 
diseases. 
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Figure  16:  MATLAB  ANN  Derived  Diseases  for  NCC  6-HO-BDE-47 

Figure  17  shows  the  network  results  when  the  uncurated  chemical  4,4’- 
diiodobiphenyl  (DIB)  was  input  into  the  network.  Seven  output  predictions  were 
generated  from  the  10  inputs  and  endocrine  system  diseases  had  literature  so  support  the 
network’s  prediction.  Yomada-Okabe  et  ah,  2005  published  research  findings  suggesting 
DIB  affected  thyroid  hormone  receptors  by  inhibiting  gene  expression.  From  their  work, 
Yomada-Okabe  et  al.  concluded  that  DIB  affects  the  luciferase  gene  by  enhancing  the 
expression  of  it.  Mediation  of  the  luciferasse  gene  has  been  documented  to  act  as  an 
endocrine  disrupter  in  animals  and  humans.  This  relationship  indicates  that  DIB  could  be 
a  potential  source  of  endocrine  disease  as  predicted  by  the  model.  The  peer-reviewed 
literature  does  not  contain  studies  that  have  examined  the  connection  between  DIB  and 
any  of  the  other  predicted  diseases. 
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Figure  17:  MATLAB  ANN  Derived  Diseases  for  NCC  454’-diiodobiphenyl 
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VI.  Conclusions  and  Recommendations 


Research  Conclusions 

The  conclusions  from  this  research  are  as  follows: 

1)  A  new  chemical  classification  system  was  created  to  identify  chemicals  with  a 
unique  number  based  on  structural  characteristics  of  the  chemical.  This  new 
system  facilitates  the  analysis  of  relationships  between  chemicals  and 
diseases.  While  the  use  of  molecular  weight,  hydrogen  acceptors,  and 
hydrogen  donors  proved  sufficient  for  creating  the  classification  system  that  is 
able  to  be  used  in  a  predictive  ANN,  the  ANN  model  results  do  not  prove  that 
these  three  variables  provide  the  best  performance  results  for  predicting 
chemical-disease  associations.  Other  chemical  characteristics  may  provide 
equal  or  better  results. 

2)  Artificial  neural  networks  were  successfully  employed  to  associated  chemicals 
and  diseases.  Initial  simulations  with  TVT  ratios  of  80-10-10  percent 
produced  coefficients  of  determination  equal  to  0.99.  The  ANN  derived 
diseases  were  predicted  using  inputs  that  were  formatted  according  to  the  new 
chemical  classification  system. 

3)  The  TVT  ratio  of  70-15-15  percent  provided  the  best  network  performance 
when  compared  to  other  TVT  ratios.  When  compared  to  the  other  ratios 
tested  in  the  network,  the  70-15-15  percent  TVT  had  the  lowest  performance 
values,  or  lowest  error  values,  for  ratios  that  produced  network  with  zero 
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validation  checks.  The  lack  of  validation  checks  shows  that  the  ANN  was 


able  to  properly  train  the  data.  Additionally,  the  70-15-15  percent  ratio  had 
the  highest  R  value  (0.99)  of  the  TVT  ratios  indicating  it  has  a  greater 
likelihood  of  accurately  predicting  disease  output  values. 

4)  The  trainlm  (Levenberg-Marquardt  backpropagation)  function  provided  the 
best  network  performance.  The  trainlm  function  had  an  R  -value  of  0.99  and  a 
one-to-one  slope  on  the  actual  disease  number  versus  ANN  derived  disease 
number  plot.  The  trainrp  and  trainscg  functions  also  produce  good  results  with 
R  -values  of  0.99  and  0.976  respectively;  however,  the  trainrp  function  results 
did  not  follow  a  one-to-one  linear  slope  and  the  trainscg  function  did  not 
produce  linear  results.  The  high  R  value  indicates  that  the  trainlm  function 
has  a  greater  chance  of  correctly  predicting  disease  output  values. 

5)  The  ANN  has  potential  to  predict  chemical-disease  associations  that  are  not 
yet  curated.  Cystaphos  was  correlated  to  neoplasms  and  two  independent 
literature  sources  supported  the  ANN  prediction.  The  ANN  predicted  that 
3,5-dibromo-2-(2,4-dibromophenoxy)phenol  (6-HO-BDE-47)  was  associated 
with  nervous  system  diseases  and  a  research  documentation  supported  this 
finding.  A  separate  literature  source  concluded  that  6-HO-BDE-47  was  also 
linked  to  endocrine  diseases  but  the  ANN  failed  to  make  that  connection. 
4,4’-diiodobiphenyl  (DIB)  was  correctly  matched  to  endocrine  disease  and 
supported  by  an  independent  research  report.  The  ANN  has  demonstrated  the 
potential  to  predict  disease-associations  for  new  chemicals  and  to  guide 
research  for  existing  chemicals  that  require  toxicological  testing. 
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Recommendations  for  Future  Research 


Future  research  should  carry  out  the  following  activities: 

1)  Laboratory  testing  of  chemical-disease  associations  that  are  predicted  by  the 
ANN  model  presented  in  this  thesis,  followed  by  possible  refinements  or 
modifications  to  the  new  chemical  classification  system.  This  should  be 
carried  out  for  both  curated  and  uncurated  chemicals.  One  possibility  would 
be  to  analyze  chemicals  that  were  grandfathered  into  the  Toxic  Substances 
Control  Act  inventory  whose  chemical-disease  associations  are  unknown. 

2)  Developing  an  ANN  that  correlates  chemical-gene  expression  associations  in 
order  to  develop  a  tool  that  provides  insight  into  the  mechanisms  that  cause 
(or  prevent)  disease. 

3)  Testing  of  additional  factors  in  the  ANN  to  produce  a  more  accurate  model  for 
correlating  chemicals,  genes,  and  diseases.  Utilized  a  loop  function  to 
iteratively  step  through  every  possible  TVT  ratio  combination  may  discover  a 
more  optimal  ratio  to  use  in  the  network.  Additionally,  testing  transfer 
functions  within  the  ANN  hidden  layer  or  adding  additional  hidden  layers  has 
the  potential  to  show  increased  network  predictive  performance  as  well. 
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Appendix  A:  MATLAB  ANN  Code 


1  %  This  code  is  used  for  the  thesis  work  conducted  by  Capt  Brouch 

2  %  This  code  relates  chemical  input  data  to  species  and  disease  output  data 

3  %  This  code  is  based  on  the  sample  input-output  fitting  network  in  the  MATLAB  ANN  Guide 

4  %  The  script  was  last  revised  on  22  Jan  2014  by  Capt  Brouch 

5  %  The  format  for  the  input  matrix  is:  [’Molecular  Weight’  ’Hydrogen  Acceptors’  ’Hydrogen  Donors’] 

6  %  The  format  for  the  output  matrix  is:  [’Species’  ’Dummy  Variable’  ’Disease’] 

7  %  The  format  for  the  input  1  matrix  is:  [’Molecular  Weight’  ’Hydrogen  Acceptors’  ’Hydrogen  Donors’] 

8  %  The  input  1  matrix  contains  uncurated  chemical  data  used  to  predict  disease  outputs 

9 

1 0-  tstart  =  clock; 

11 

12-  A  =  zeros(1173,3); 

13-  B  =  zeros(l  173,3); 

14-  C  =  zeros(l  173,3); 

15 

16-  A(:,l)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Input-Output  Tables’,’b2:bl  174’); 

17-  A(:,2)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Input-Output  Tables’,’c2:cl  174’); 

18-  A(:,3)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Input-Output  Tables’,’d2:dl  174’); 

19-  B(:,l)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Input-Output  Tables’,’e2:el  174’); 

20-  B(:,3)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Input-Output  Tables’,’g2:gl  174’); 

21-  C(:,l)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Uncurated’,’c3:cl  175’); 

22-  C(:,2)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Uncurated’,’d3:dl  175’); 

23-  C(:,3)  =  xlsread(’ J:\Brouch\Brouch  Thesis\Excel  Files\Brouch  Thesis  ANN  Data.xls’,’Uncurated’,’e3:el  175’); 

24 

25-  inputs  =  A; 

26-  targets  =  B; 

27-  inputsl=C; 

28-  %  the  targets  matrix  is  the  same  as  the  output  matrix 

29 

30  %  preallocate  the  plotting  matrix  (PM) 

31-  PM  =  zeros(l  173,5); 

32 

33  %  the  variable  is  interest  is  vv  -  this  is  the  column  number  in  the  target  matrix 
34-  vv  =  3; 

35 

36-  countt=l; 

37 

38-  for  count  =  1:1:5 

39  %  Create  a  Fitting  Network 

40-  hiddenLayerSize  =  1 ; 

4 1  -  net  =  fitnet(hiddenLayer  Size) ; 

42  %  Set  up  Division  of  Data  for  Training,  Validation,  Testing 

43-  net.divideParam.trainRatio  =  70/100; 

44-  net.divideParam.valRatio  =  15/100; 

45-  net.divideParam.testRatio  =  15/100; 
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46  %  Train  the  Network 

47-  [net,tr]  =  trainlm(net,  inputs,  targets); 

48  %  Test  the  Network 

49-  outputs  =  net(inputs); 

50-  errors  =  gsubtract(outputs,  targets); 

5 1  -  performance  =  perform(net, targets, outputs) ; 

52  %  View  the  Network 

53  %  view(net) 

54  %  plotperf(tr) 

55 

56-  Outputs  =  net(inputs); 

57  %  trOut  =  Outputs(tr.trainlnd); 

58  %  vOut  =  Outputs(tr.vallnd); 

59  %  tsOut  =  Outputs(tr.testlnd); 

60  %  trTarg  =  Outputs(tr.trainlnd); 

6 1  %  vT arg  =  targets(tr . valind) ; 

62  %  tsTarg  =  targets(tr.testlnd); 

63  %  figure  (98) 

64  %  plotregression(trTarg,trOut,Train',vTarg,vOut,’Validation’,tsTarg,tsOut,Testing'); 

65 

66-  PM(:,countt)  =  Outputs(:,vv); 

67-  countt  =  countt  + 1 ; 

68-  end 
69 

70-  figure(l) 

71  %plot(targets(:,vv),PM(:,l),'ro') 

72  %plot(targets(:,vv),PM(:,l),’ro’,targets(:,vv),PM(:,2),’go',targets(:,vv),PM(:,3),’ko’,  targets(:,w),PM(:,4),’m- 
targets(:,w),PM(:,5),’k-.') 

73-  title(The  relationship  between  actual  and  predicted  diseases,  version  T) 

74-  xlabel(’ Actual  Disease  Number') 

75-  ylabel('ANN-Derived  Disease  Number') 

76 

77-  tstop  =  clock; 

78-  runtime  =  etime  (tstop,tstart)/60; 

79-  disp('length  of  run  in  minutes:') 

80-  disp(runtime) 
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This  MATLAB  code  was  used  to  set  up  and  run  the  various  ANN  simulations 
discussed  in  the  paper.  The  first  eight  lines  of  code  explain  what  the  code  provide  some 
background  information  about  what  the  code  is  being  used  for  and  how  the  input  and 
output  data  is  organized.  Line  10  begins  a  clock  to  track  how  long  the  ANN  simulations 
take  to  complete.  Lines  12-14  create  matrices  for  the  input  and  output  data  obtain  from 
Microsoft  Excel  spreadsheets.  Lines  15-22  dictate  where  MATLAB  fill  find  the  data 
needed  for  the  input  and  output  matrices.  Lines  16-18  and  19-20  state  the  data  needed  for 
the  input  and  output  matrices,  respectively.  Lines  21-23  identify  the  uncurated  chemical 
data  used  to  generate  the  ANN  derived  disease  predictions.  Lines  25-27  define  the  input, 
output,  and  outputl  matrices  created  in  lines  12-14.  Line  31  creates  the  plotting  matrix 
where  the  ANN  derive  diseases  values  will  be  saved.  In  this  code,  the  matrix  is  created 
with  five  columns  so  the  five  simulations  can  be  run  back  to  back.  Line  34  designates  the 
variable  of  interest  to  be  saved  in  the  plotting  matrix.  For  this  code,  the  variable  of  the 
interest  is  the  third  column  in  the  target  matrix:  disease.  Lines  38-68  control  and  run  the 
actual  ANN.  Line  38  establishes  how  many  simulations  the  network  will  run.  For  this 
code,  5  simulations  are  run  to  match  the  number  of  columns  in  the  plotting  matrix.  Line 
40  controls  the  number  of  hidden  layers  the  network  used.  Lines  43-45  control  how  the 
network  splits  up  the  data  according  to  the  TVT  ratio  being  used  in  the  simulation.  Line 
47  designates  the  training  function  being  used  in  the  network.  Line  56  establishes  what 
input  data  the  network  will  use  to  generate  the  ANN  derive  disease  outputs.  Lines  66-68 
tell  the  network  to  stop  running  simulations  when  it  reaches  the  predetermined  maximum 
number  established  in  line  38.  Lines  70-75  generate  the  actual  disease  number  versus 
ANN  derived  disease  number  plot  once  the  network  simulations  are  complete.  Lines  77- 
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80  stop  the  clock  that  was  started  in  line  10  and  records  the  total  time  in  took  the  network 
to  run  all  of  the  simulations. 

For  the  simulations  testing  the  different  TVT  ratios  and  training  functions,  only 
lines  16-20  were  used  to  obtain  the  input  and  output  data.  The  input  data  for  uncurated 
chemicals  was  not  used  in  these  simulations.  When  adjusting  the  TVT  ratios  and  training 
functions,  lines  43-45  and  47  were  the  only  lines  of  code  that  required  editing. 

When  the  uncurated  chemical  data  was  used  to  generate  disease  value  predictions, 
the  code  was  first  run  using  the  original  input  and  output  to  establish  the  network  using 
the  curate  data.  Then  the  command  “outputs=net(inputsl)”  was  entered  into  the 
MATLAB  command  window.  This  told  the  network  to  use  the  new  inputs  containing  the 
uncurated  to  derive  disease  value  predicts. 
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Appendix  B:  Input  and  Output  Matrices 


Chemical 


Molecular 

Weight 

58.08 


Hydrogen  Hydrogen 
Acceptors  Donors 


Species  Dummy  Variable  Disease 
Number  Column  Number 


Acetone 


Aciclovir 


Alprazolam 


Ammonium  Sulfate 


58.08 

58.08 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

225.21 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

308.77 

132.14 

132.14 

132.14 

180.16 


180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

180.16 

4 

1 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

266.34 

5 

4 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

749.00 

14 

5 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

78.12 

0 

0 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

334.40 

6 

2 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

194.19 

6 

0 

515.65 

8 

2 

515.65 

8 

2 

515.65 

8 

2 

515.65 

8 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

236.28 

3 

2 

40.00 

1 

1 

40.00 

1 

1 

40.00 

1 

1 

40.00 

1 

1 

40.00 

1 

1 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

323.14 

7 

3 

252.34 

6 

3 

1 

0 

2 

252.34 

6 

3 

1 

0 

3 

252.34 

6 

3 

1 

0 

4 

252.34 

6 

3 

1 

0 

5 

252.34 

6 

3 

1 

0 

7 

252.34 

6 

3 

1 

0 

8 

252.34 

6 

3 

1 

0 

9 

252.34 

6 

3 

1 

0 

11 

252.34 

6 

3 

1 

0 

12 

252.34 

6 

3 

1 

0 

13 

Cimetidine 

252.34 

252.34 

6 

3 

1 

0 

14 

6 

3 

1 

0 

15 

252.34 

6 

3 

1 

0 

16 

252.34 

6 

3 

1 

0 

17 

252.34 

6 

3 

1 

0 

19 

252.34 

6 

3 

1 

0 

21 

252.34 

6 

3 

1 

0 

22 

252.34 

6 

3 

1 

0 

23 

252.34 

6 

3 

1 

0 

24 

252.34 

6 

3 

1 

0 

25 

252.34 

6 

3 

1 

0 

26 

252.34 

6 

3 

1 

0 

27 

230.10 

3 

2 

1 

0 

3 

230.10 

3 

2 

1 

0 

4 

230.10 

3 

2 

1 

0 

5 

230.10 

3 

2 

1 

0 

7 

230.10 

3 

2 

1 

0 

8 

230.10 

3 

2 

1 

0 

9 

230.10 

3 

2 

1 

0 

11 

230.10 

3 

2 

1 

0 

12 

Clonidine 

230.10 

230.10 

3 

2 

1 

0 

13 

3 

2 

1 

0 

14 

230.10 

3 

2 

1 

0 

16 

230.10 

3 

2 

1 

0 

17 

230.10 

3 

2 

1 

0 

19 

230.10 

3 

2 

1 

0 

21 

230.10 

3 

2 

1 

0 

22 

230.10 

3 

2 

1 

0 

23 

230.10 

3 

2 

1 

0 

24 

230.10 

3 

2 

1 

25 

80 


Copper  Sulfate 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

159.61 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

4 

5 

9 

12 

13 

14 

16 

17 

19 

21 

22 

25 

1202.64 

23 

5 

9 

0 

1 

1202.64 

23 

5 

1 

0 

2 

1202.64 

23 

5 

1 

0 

3 

1202.64 

23 

5 

1 

0 

4 

1202.64 

23 

5 

1 

0 

5 

1202.64 

23 

5 

1 

0 

7 

1202.64 

23 

5 

1 

0 

8 

1202.64 

23 

5 

1 

0 

9 

1202.64 

23 

5 

1 

0 

10 

1202.64 

23 

5 

1 

0 

11 

1202.64 

23 

5 

1 

0 

12 

1202.64 

23 

5 

1 

0 

13 

Cyclosporine 

1202.64 

23 

5 

1 

0 

14 

1202.64 

23 

5 

1 

0 

15 

1202.64 

23 

5 

1 

0 

16 

1202.64 

23 

5 

1 

0 

17 

1202.64 

23 

5 

1 

0 

19 

1202.64 

23 

5 

1 

0 

20 

1202.64 

23 

5 

1 

0 

21 

1202.64 

23 

5 

1 

0 

22 

1202.64 

23 

5 

1 

0 

23 

1202.64 

23 

5 

1 

0 

23 

1202.64 

23 

5 

1 

0 

25 

1202.64 

23 

5 

1 

0 

26 
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266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

266.39 

2 

1 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

392.47 

5 

3 

284.75 
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Appendix  C:  Additional  MATLAB  Training  Sessions 


Figures  C.1-C.7  shows  the  ANN  simulations  results  for  a  network  using  the 
training  function  trainscg.  The  trainscg  training  function  is  similar  the  default  function 
shown  in  Chapter  4,  but  it  uses  a  scaled  conjugate  gradient  to  train  the  data  in  the 
network.  Trainscg  also  does  not  use  mu  as  a  network  parameter  to  measure  how  well  the 
network  is  performing  during  training.  This  session  was  terminated  due  to  the  upper 
limits  of  epochs  being  reached.  Even  through  the  network  reached  the  maximum  number 
of  epochs,  it  still  performed  fairly  well  with  an  R-value  of  0.93898.  Figure  C.2  shows 
that  the  actual  versus  ANN  derived  disease  plot  appeared  to  follow  a  linear  line  but  did 
have  some  curve  to  it.  This  curve  to  the  output  data  would  explain  the  R-value  being  less 
than  one.  The  performance  and  MSE  plots  (figures  C.3  and  C.4)  look  identical  because 
MSE  is  used  with  trainscg  to  measure  the  network  performance.  It  should  also  be  noted 
that  the  network  was  consistently  able  to  lower  the  performance/MSE  over  the  course  of 
the  simulation  indicating  the  network  was  able  to  continually  improve  over  time.  Figure 
C.5  shows  the  regression  plot  which  makes  it  clear  that  the  ANN  derived  disease  values 
did  not  directly  lay  along  the  one-to-one  slope.  Figure  C.6  shows  numerous  fluctuations 
in  the  gradient  and  the  network  tried  to  increase  the  performance  and  Figure  C.7  shows 
the  error  histogram  highlighting  that  most  of  the  error  in  the  network  was  small,  but  not 
small  enough  to  produce  a  linear  output. 
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Figure  C.l:  Trainscg  Training  Session 
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The  relationship  between  actuall  and  predicted  diseases,  version  1 
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Figure  C.2:  Trainscg  Actual  Disease  versus  ANN  Derived  Disease  Outputs  Plot 


108 


Figure  C.3:  Trainscg  Performance  Plot 
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Best  Training  Performance  is  7.89G3  at  epoch  1 000 


1000  Epochs 


Figure  C.4:  Trainscg  Mean  Squared  Error  Plot 
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Training:  R=0.93898 


Figure  C.5:  Trainscg  Regression  Plot 
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Figure  C.6:  Transcg  Training  States 
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Figure  C.7:  Trainscg  Error  Histogram 


Figures  C.8-C.14  shows  the  ANN  simulations  results  for  a  network  using  the 
training  function  trainrp.  The  trainrp  training  function  is  similar  the  default  function 
shown  in  Chapter  4,  but  it  uses  resilient  backpropagation  to  train  the  data  in  the  network. 
Trainrp  also  does  not  use  mu  as  a  network  parameter  to  measure  how  well  the  network  is 
performing  during  training.  This  session  was  terminated  due  to  the  gradient  lower  limit 
indicating  the  network  potentially  performed  well.  However,  then  reviewing  Figure  C.9 
and  the  R-value,  it  is  apparent  the  ANN  derived  disease  values  do  not  follow  a  one-to-one 
slope.  In  addition  to  the  R-value  of  0.53903,  the  regressions  plot  in  figure  C.12  shows 
the  ANN  derived  disease  values  follow  a  slope  of  0.5  to  1  which  indicates  the  network 
predict  values  half  that  of  the  actual  values.  The  performance  and  MSB  plots  (Figures 
C.9  and  C.IO)  look  identical  because  MSB  is  used  with  trainscg  to  measure  the  network 
performance.  Figure  C.13  shows  a  consistently  decreasing  gradient  and  Figure  C.14 
shows  the  error  histogram  highlighting  that  minimal  error  occurred  over  a  wide  range,  the 
majority  of  the  error  was  small. 
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Figure  C.8:  Trainrp  Training  Session 
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The  relationship  between  actuall  and  predicted  diseases,  version  1 
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Figure  C.9:  Trainsrp  Actual  Disease  versus  ANN  Derived  Disease  Outputs  Plot 
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Figure  C.IO:  Trainrp  Performance  Plot 
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Best  Training  Performance  is  41.9879  at  epoch  45 


45  Epochs 


Figure  C.ll:  Trainrp  Mean  Squared  Error  Plot 
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Figure  C.13:  Trainrp  Training  Parameters 
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Figure  C.14:  Trainrp  Error  Histogram 
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Appendix  D:  TVT  Graphs 


The  following  figures  show  the  results  of  the  five  ANN  simulations  for  each  TVT 
ratio.  The  70-15-15  percent  and  80-10-10  percent  ratios  were  the  only  ones  to  produce 
linear  plots  where  all  of  the  ANN  derived  values  were  positive  and  the  70-15-15  percent 
ratio  was  the  only  one  to  produce  ANN  derived  values  on  the  one-to-one  slope  for  all  five 
simulations.  The  default  train  training  function  was  used  for  all  of  the  TVT  simulations. 
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Figure  D.l:  50-25-25  %  TVT  Ratio  Plot 
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Figure  D.2:  60-20-20%  TVT  Ratio  Plot 
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Figure  D.3:  70-15-15%  TVT  Ratio  Plot 
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Figure  D.4:  80-10-10%  TVT  Ratio  Plot 
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Figure  D.5:  90-5-5%  TVT  Ratio  Plot 
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Appendix  E:  List  of  Training  Functions 


Table  E.l  shows  each  of  the  training  functions  used  in  the  ANN  simulations  along 
with  a  brief  of  description  of  how  each  function  trains  the  network.  All  of  the 
descriptions  were  found  in  the  MATLAB  Neural  Network  Toolbox  Reference  Guide. 
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Table  E.l:  Training  Function  Descriptions 


Training 

Function 

Description 

train 

Default  training  function  used  in  the  MATLAB  ANN.  Usually 
defaults  to  trainim  functions. 

trainb 

Trains  a  network  using  batch  training  by  updating  weight  and  bias 
learning  rules 

trainc 

Trains  a  network  cyclical  order  weight  and  bias  learning  rules 

traincgb 

Trains  a  network  using  conjugate  gradient  backpropagation  in 
conjunction  with  Powell-Beale  restarts  to  update  weight  and  bias 
values 

traincgf 

Trains  a  network  using  conjugate  gradient  backpropagation  in 
conjunction  with  Fletcher-Reeves  updates  to  update  weight  and 
bias  values 

traincgp 

Trains  a  network  using  conjugate  gradient  backpropagation  in 
conjunction  with  Polak-Ribiere  updates  to  update  weight  and  bias 
values 

traingd 

Trains  a  network  using  gradient  descent  backpropagation  to 
update  weight  and  bias  values 

traingda 

Trains  a  network  using  gradient  descent  with  adaptive  learning 
rate  backpropagation  to  update  weight  and  bias  values 

traingdm 

Trains  a  network  using  gradient  descent  with  momentum 
backpropagation  to  update  weight  and  bias  values 

traingdx 

Trains  a  network  using  gradient  descent  with  momentum  and 
adaptive  learning  rate  backpropagation  to  update  weight  and  bias 
values 

trainim 

Trains  a  network  using  Levenberg-Marquardt  backpropagation  to 
update  weight  and  bias  values 

trainoss 

Trains  a  network  using  One-step  secant  backpropagation  to 
update  weight  and  bias  values 

trainr 

Trains  a  network  using  random  order  incremental  training  with 
learning  functions  to  update  weight  and  bias  values 

trainrp 

Trains  a  network  using  resilient  backpropagation  to  update 
weight  and  bias  values 

trains 

Trains  a  network  using  sequential  order  incremental  training  with 
learning  functions  to  update  weight  and  bias  values 

trainscg 

Trains  a  network  using  scaled  conjugate  gradient 
backpropagation  to  update  weight  and  bias  values 
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Appendix  F:  Training  Function  Plots 


The  following  Figures  F.1-F.15  show  the  actual  disease  values  versus  the  ANN 
derived  disease  values  for  the  15  training  functions  tested  in  the  ANN.  Each  plot  shows 
the  three  simulations  run  per  training  function,  as  well  as  a  solid  black  line  representing 
the  desired  one  to  one  slope.  The  one  to  one  slope  line  makes  it  easier  to  distinguish 
whether  individual  trials  and  training  functions  as  a  whole  were  able  to  generate  disease 
values  similar  to  the  actual  values. 
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Figure  F.l:  Trainb  Function  Disease  Plot 
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Figure  F.2:  Trainc  Function  Disease  Plot 
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Figure  F.3:  Traincgb  Function  Disease  Plot 
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Figure  F.4:  Traincgf  Function  Disease  Plot 
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Figure  F.5:  Traincgp  Function  Disease  Plot 
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Figure  r.6:  Traingd  Function  Disease  Plot 
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Figure  F.7:  Traingda  Function  Disease  Plot 
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Figure  F.8:  Traingdm  Function  Disease  Plot 


138 


V  Trial  1 


■  Trial  2 


-  Trial  3 


Figure  F.9:  Traingdx  Function  Disease  Plot 
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Figure  F.IO:  Trainlm  Function  Disease  Plot 
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Figure  F.ll:  Trainoss  Function  Disease  Plot 
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Figure  F.12:  Trainr  Function  Disease  Plot 
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Figure  F.13:  Trainrp  Function  Disease  Plot 
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Figure  F.14:  Trains  Function  Disease  Plot 
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Figure  F.15:  Trainscg  Function  Disease  Plot 
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Appendix  G:  Uncurated  chemical  data 


Table  G.  1  contains  the  input  data  for  each  of  the  three  uncurated  chemicals  used 
to  test  the  predictability  of  the  ANN.  The  ANN  derived  species  and  disease  outputs  are 
also  shown  with  the  disease  outputs  round  to  the  nearest  whole  number.  The  rounding 
allowed  the  predictions  to  be  matched  up  against  the  values  used  for  the  disease  groups. 
The  bolded  rounded  disease  numbers  and  disease  category  outputs  are  those  that  have 
literature  supporting  the  networks  predictions. 
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Table  G.l:  Uncurated  Chemical  Input,  Output,  and  ANN  Derived  Disease  Number  Data 
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